In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab_5.ipynb")

# Lab 5: Functions and Visualizations

Welcome to Lab 5! This week, we'll learn about functions, table methods such as `apply`, and how to generate visualizations! 

Recommended Reading:

* [Applying a Function to a Column](https://www.inferentialthinking.com/chapters/08/1/applying-a-function-to-a-column.html)
* [Visualizations](https://www.inferentialthinking.com/chapters/07/visualization.html)

Recommended Videos:
* Intro to Functions
* Applying Functions to Columns
* Grouping by One Column
* Intro to Lists
* Grouping by Two Columns & Pivot Tables
* Intro to Joins

First, set up the notebook by running the cell below.

In [1]:
import numpy as np
from datascience import *

# These lines set up graphing capabilities.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)

from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets


**Submission**: Once you're finished, select "Save and Checkpoint" in the File menu and then execute the submit cell at the end. The result will contain a link that you can use to check that your assignment has been submitted successfully. 

# 1. Defining functions

Let's write a very simple function that converts a proportion to a percentage by multiplying it by 100.  For example, the value of `to_percentage(.5)` should be the number 50 (no percent sign).

A function definition has a few parts.

##### `def`
It always starts with `def` (short for **def**ine):

    def

##### Name
Next comes the name of the function.  Like other names we've defined, it can't start with a number or contain spaces. Let's call our function `to_percentage`:
    
    def to_percentage

##### Signature
Next comes something called the *signature* of the function.  This tells Python how many arguments your function should have, and what names you'll use to refer to those arguments in the function's code.  A function can have any number of arguments (including 0!). 

`to_percentage` should take one argument, and we'll call that argument `proportion` since it should be a proportion.

    def to_percentage(proportion)
    
If we want our function to take more than one argument, we add a comma between each argument name. Note that if we had zero arguments, we'd still place the parentheses () after than name. 

We put a colon after the signature to tell Python it's over. If you're getting a syntax error after defining a function, check to make sure you remembered the colon!

    def to_percentage(proportion):

##### Documentation
Functions can do complicated things, so you should write an explanation of what your function does.  For small functions, this is less important, but it's a good habit to learn from the start.  Conventionally, Python functions are documented by writing an **indented** triple-quoted string:

    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
    
    
##### Body
Now we start writing code that runs when the function is called.  This is called the *body* of the function and every line **must be indented with a tab**.  Any lines that are *not* indented and left-aligned with the def statement is considered outside the function. 

Some notes about the body of the function:
- We can write code that we would write anywhere else.  
- We use the arguments defined in the function signature. We can do this because we assume that when we call the function, values are already assigned to those arguments.
- We generally avoid referencing variables defined *outside* the function. If you would like to reference variables outside of the function, pass them through as arguments!


Now, let's give a name to the number we multiply a proportion by to get a percentage:

    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
        factor = 100

##### `return`
The special instruction `return` is part of the function's body and tells Python to make the value of the function call equal to whatever comes right after `return`.  We want the value of `to_percentage(.5)` to be the proportion .5 times the factor 100, so we write:

    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
        factor = 100
        return proportion * factor
        
`return` only makes sense in the context of a function, and **can never be used outside of a function**. `return` is always the last line of the function because Python stops executing the body of a function once it hits a `return` statement.

*Note:*  `return` inside a function tells Python what value the function evaluates to. However, there are other functions, like `print`, that have no `return` value. For example, `print` simply prints a certain value out to the console. 

`return` and `print` are **very** different. 

<span style="color:blue">**Question 1.0.1**</span> Define `to_percentage` in the cell below.  Call your function to convert the proportion .2 to a percentage.  Name that percentage `twenty_percent`.

<!--
BEGIN QUESTION
name: q11
-->

In [2]:
def to_percentage(proportion): 
    """" Converts a proportion to a percentage"""
    ...

twenty_percent = ...
twenty_percent

In [None]:
grader.check("q11")

Like you’ve done with built-in functions in previous labs (max, abs, etc.), you can pass in named values, otherwise known as variables, as arguments to your function.

<span style="color:blue">**Question 1.0.2**</span> Use `to_percentage` again to convert the proportion named `a_proportion` (defined below) to a percentage called `a_percentage`.

*Note:* You don't need to define `to_percentage` again!  Like other named values, functions stick around after you define them.

<!--
BEGIN QUESTION
name: q12
-->

In [4]:
a_proportion = 2**(.5) / 2
a_percentage = ...
a_percentage

In [None]:
grader.check("q12")

Here's something important about functions: the names assigned *within* a function body are only accessible within the function body. Once the function has returned, those names are gone.  So even if you created a variable called `factor` and defined `factor = 100` inside of the body of the `to_percentage` function and then called `to_percentage`, `factor` would not have a value assigned to it outside of the body of `to_percentage`:

In [1]:
# You should see an error when you run this.  (If you don't, you might
# have defined factor somewhere above.)
factor

As we've seen with built-in functions, functions can also take strings (or arrays, or tables) as arguments, and they can return those things, too.

<span style="color:blue">**Question 1.0.3**</span> Define a function called `disemvowel`.  It should take a single string as its argument.  (You can call that argument whatever you want.)  It should return a copy of that string, but with all the characters that are vowels removed.  (In English, the vowels are the characters "a", "e", "i", "o", and "u".) You can use as many lines inside of the function to do this as you’d like.

*Hint:* To remove all the "a"s from a string, you can use `that_string.replace("a", "")`.  The `.replace` method for strings returns a new string, so you can call `replace` multiple times, one after the other. 

<!--
BEGIN QUESTION
name: q13
-->

In [7]:
def disemvowel(a_string):
    """Removes all vowels from a string.""" 
    ...

# An example call to your function.  (It's often helpful to run
# an example call from time to time while you're writing a function,
# to see how it currently works.)
disemvowel("Can you read this without vowels?")

In [None]:
grader.check("q13")

##### Calls on calls on calls
Just as you write a series of lines to build up a complex computation, it's useful to define a series of small functions that build on each other.  Since you can write any code inside a function's body, you can call other functions you've written.

If a function is a like a recipe, defining a function in terms of other functions is like having a recipe for cake telling you to follow another recipe to make the frosting, and another to make the jam filling.  This makes the cake recipe shorter and clearer, and it avoids having a bunch of duplicated frosting recipes.  It's a foundation of productive programming.

For example, suppose you want to count the number of characters *that aren't vowels* in a piece of text.  One way to do that is this to remove all the vowels and count the size of the remaining string.

<span style="color:blue">**Question 1.0.4**</span> Write a function called `num_non_vowels`.  It should take a string as its argument and return a number.  That number should be the number of characters in the argument string that aren't vowels. You should use the `disemvowel` function you wrote above inside of the `num_non_vowels` function.

*Hint:* The function `len` takes a string as its argument and returns the number of characters in it.

<!--
BEGIN QUESTION
name: q14
-->

In [9]:
def num_non_vowels(a_string):
    """The number of characters in a string, minus the vowels."""
    ...

# Try calling your function yourself to make sure the output is what
# you expect. You can also use the interact function in the next cell if you'd like.

In [None]:
grader.check("q14")

Functions can also encapsulate code that *displays output* instead of computing a value. For example, if you call `print` inside a function, and then call that function, something will get printed.

The `movies_by_year` dataset in the textbook has information about movie sales in recent years.  Suppose you'd like to display the year with the 5th-highest total gross movie sales, printed in a human-readable way.  You might do this:

In [11]:
movies_by_year = Table.read_table("movies_by_year.csv")
rank = 5
fifth_from_top_movie_year = movies_by_year.sort("Total Gross", descending=True).column("Year").item(rank-1)
print("Year number", rank, "for total gross movie sales was:", fifth_from_top_movie_year)

After writing this, you realize you also wanted to print out the 2nd and 3rd-highest years.  Instead of copying your code, you decide to put it in a function.  Since the rank varies, you make that an argument to your function.

<span style="color:blue">**Question 1.0.5**</span> Write a function called `print_kth_top_movie_year`.  It should take a single argument, the rank of the year (like 2, 3, or 5 in the above examples).  It should print out a message like the one above.  

*Note:* Your function shouldn't have a `return` statement.

<!--
BEGIN QUESTION
name: q15
-->

In [12]:
...
print(...)

...

# Example calls to your function:
print_kth_top_movie_year(2)
print_kth_top_movie_year(3)

In [None]:
grader.check("q15")

### `print` is not the same as `return`
The `print_kth_top_movie_year(k)` function prints the total gross movie sales for the year that was provided! However, since we did not return any value in this function, we can not use it after we call it. Let's look at an example of another function that prints a value but does not return it.

In [15]:
def print_number_five():
    print(5)

In [16]:
print_number_five()

However, if we try to use the output of `print_number_five()`, we see that the value `5` is printed but we get a TypeError when we try to add the number 2 to it!

In [17]:
print_number_five_output = print_number_five()
print_number_five_output + 2

It may seem that `print_number_five()` is returning a value, 5. In reality, it just displays the number 5 to you without giving you the actual value! If your function prints out a value without returning it and you try to use that value, you will run into errors, so be careful!

Explain to your neighbor how you might add a line of code to the `print_number_five` function (after `print(5)`) so that the code `print_number_five_output + 5` would result in the value `10`, rather than an error.

# 2. Functions and CEO Incomes

In this question, we'll look at the 2019 compensation of Canada's highest paid CEOs. The data was compiled from the Canadian Centre for Policy Alternatives, in a report titled [The Golden Cushion](https://www.policyalternatives.ca/sites/default/files/uploads/publications/National%20Office/2021/01/Golden%20cushion.pdf). 

We've copied the data from the PDF report into a file called `salaries.csv`. Please note that all salaries are in CAD amounts.

In [18]:
salaries = Table.read_table('salaries.csv')
salaries

Let's say we want to compute the average of the CEOs' pension value. Try running the cell below.

In [19]:
np.average(salaries.column("Pension Value"))


You should see a TypeError. Let's examine why this error occurred by looking at the values in the `Pension Value` column. 

<span style="color:blue">**Question 2.0.1**</span> Use the `type` function and set `pension_value_type` to the type of the third value in the "Pension Value" column -- since that's our first row with a pension value recorded.

<!--
BEGIN QUESTION
name: q21
-->

In [20]:
pension_value_type = ...
pension_value_type

In [None]:
grader.check("q21")

<span style="color:blue">**Question 2.0.2**</span> You should have found that the values in the `Pension Value` column are strings. It doesn't make sense to take the average of string values, so we need to convert them to numbers if we want to do this. Extract D. Mark Bristow's pension value in 2019, in dollars.  Call it `mark_bristow_pension_string`.

<!--
BEGIN QUESTION
name: q22
-->

In [24]:
mark_bristow_pension_string = ...
mark_bristow_pension_string

In [None]:
grader.check("q22")

<span style="color:blue">**Question 2.0.3**</span> Convert `mark_bristow_pay_string` to a number of *dollars*. 

Some hints, as this question requires multiple steps:
- The string method `strip` will be useful for removing the dollar sign; it removes a specified character from the start or end of a string.  For example, the value of `"100%".strip("%")` is the string `"100"`.  
- You'll also need the function `float`, which converts a string that looks like a number to an actual number.  

<!--
BEGIN QUESTION
name: q23
-->

In [26]:
mark_bristow_pension = ...
mark_bristow_pension

In [None]:
grader.check("q23")

To compute the average pay, we need to do this for every CEO.  But that looks like it would involve copying this code 102 times.

This is where functions come in.  First, we'll define a new function, giving a name to the expression that converts "total pay" strings to numeric values.  Later in this lab, we'll see the payoff: we can call that function on every pay string in the dataset at once.

The next section of this lab explains how to define a function For now, just fill in the ellipses in the cell below.

<span style="color:blue">**Question 2.0.4**</span> Copy the expression you used to compute `mark_bristow_pension`, and use it as the return expression of the function below. But make sure you replace the specific `mark_hurd_pension_string` with the generic `pay_string` name specified in the first line in the `def` statement. From there, we'd also like you to go a little further -- since this dataset puts in place a '-' when data is unavailable, but we'd still like to make calculations on an entire column, we'd like this dash to become a 0 so it's a workable number. Tack on a `.replace` call to accomplish this as well.

*Hint*: When dealing with functions, you should generally not be referencing any variable outside of the function. Usually, you want to be working with the arguments that are passed into it, such as `pay_string` for this function. If you're using `mark_hurd_pension_string` within your function, you're referencing an outside variable!

<!--
BEGIN QUESTION
name: q24
-->

In [29]:
def convert_pay_string_to_number(pay_string):
    """Converts a pay string like '$100' to a number of
    dollars."""
    ...

In [None]:
grader.check("q24")

Running that cell doesn't convert any particular pay string. Instead, it creates a function called `convert_pay_string_to_number` that can convert *any* string with the right format to a number representing millions of dollars.

We can call our function just like we call the built-in functions we've seen. It takes one argument -- a string -- and it returns a float.

In [32]:
convert_pay_string_to_number('$42')

In [33]:
convert_pay_string_to_number(mark_bristow_pension_string)

In [34]:
# We can also compute, say, Brian J. Porter's pay in the same way:
convert_pay_string_to_number(salaries.where("Name", are.containing("Brian")).column("Total Compensation 2019").item(0))

So, what have we gained by defining the `convert_pay_string_to_number` function? 
Well, without it, we'd have to copy the code `float(some_pay_string.strip("$"))` each time we wanted to convert a pay string.  Now we just call a function whose name says exactly what it's doing.

# 3. `apply`ing functions

Defining a function is a lot like giving a name to a value with `=`.  In fact, a function is a value just like the number 1 or the text "data"!

For example, we can make a new name for the built-in function `max` if we want:

In [35]:
our_name_for_max = max
our_name_for_max(2, 6)

The old name for `max` is still around:

In [36]:
max(2, 6)

Try just writing `max` or `our_name_for_max` (or the name of any other function) in a cell, and run that cell.  Python will print out a (very brief) description of the function.

In [37]:
max

Now try writing `?max` or `?our_name_for_max` (or the name of any other function) in a cell, and run that cell.  A information box should show up at the bottom of your screen a longer description of the function

*Note: You can also press Shift+Tab after clicking on a name to see similar information!*

In [38]:
?our_name_for_max

Let's look at what happens when we set `max` to a non-function value. You'll notice that a TypeError will occur when you try calling `max`. Things like integers and strings are not callable. Look out for any functions that might have been renamed when you encounter this type of error

In [39]:
max = 6
max(2, 6)

In [40]:
# This cell resets max to the built-in function. Just run this cell, don't change its contents
import builtins
max = builtins.max

Why is this useful?  Since functions are just values, it's possible to pass them as arguments to other functions.  Here's a simple but not-so-practical example: we can make an array of functions.

In [41]:
make_array(max, np.average, are.equal_to)

<span style="color:blue">**Question 3.0.1**</span> Make an array containing any 3 other functions you've seen.  Call it `some_functions`.

<!--
BEGIN QUESTION
name: q31
-->

In [42]:
some_functions = ...
some_functions

In [None]:
grader.check("q31")

Working with functions as values can lead to some funny-looking code. For example, see if you can figure out why the following code works. Check your explanation with a neighbor or a staff member.

In [47]:
make_array(max, np.average, are.equal_to).item(0)(4, -2, 7)

A more useful example of passing functions to other functions as arguments is the table method `apply`.

`apply` calls a function many times, once on *each* element in a column of a table.  It produces an *array* of the results.  Here we use `apply` to convert every CEO's total compensation to a number, using the function you defined:

In [48]:
salaries.apply(convert_pay_string_to_number, "Total Compensation 2019")

Here's an illustration of what that did:

<img src="apply.png"/>

Note that we didn’t write `salaries.apply(convert_pay_string_to_number(), “Total Compensation 2019)` or `salaries.apply(convert_pay_string_to_number(“Total Compensation 2019”))`. We just passed the name of the function, with no parentheses, to `apply`, because all we want to do is let `apply` know the name of the function we’d like to use and the name of the column we’d like to use it on. `apply` will then call the function `convert_pay_string_to_number` on each value in the column for us! We're essentially creating a whole new column that doesn't have the dollar sign.

<span style="color:blue">**Question 3.0.2**</span> Using `apply`, make a table that's a copy of `salaries` with one additional column called `Total Compensation 2019 ($)`.  That column should contain the result of applying `convert_pay_string_to_number` to the `Total Compensation 2019` column (as we did above).  Call the new table `compensation`.

<!--
BEGIN QUESTION
name: q32
-->

In [49]:
compensation = salaries.with_column(
    "Total Compensation 2019 ($)",
    ...
    ) 
compensation

In [None]:
grader.check("q32")

Now that we have all the pays as numbers, we can learn more about them through computation.

<span style="color:blue">**Question 3.0.3**</span> Compute the average total pay of the CEOs in the dataset.

<!--
BEGIN QUESTION
name: q33
-->

In [52]:
average_total_pay = ...
average_total_pay

In [None]:
grader.check("q33")

<span style="color:blue">**Question 3.0.4**</span> Companies pay executives in a variety of ways: in cash, by granting stock or other equity in the company, or with ancillary benefits (like private jets).  Here, let's simply compute the proportion of each CEO's total compensation that came from pension value.  (Your answer should be an array of numbers, one for each CEO in the dataset.)

*Hint:* When you answer this question, you'll notice that the proportions are quite small, and Python displays them in scientific notation. It means you're on the right track!


<!--
BEGIN QUESTION
name: q34
-->

In [54]:
cash_proportion = ...
cash_proportion

In [None]:
grader.check("q34")

And just for fun... what was the sum of the total compensation of these CEOs in 2019?


In [57]:
sum_pay_2019 = np.sum(compensation.column("Total Compensation 2019 ($)"))
sum_pay_2019

**Why is `apply` useful?**

For operations like arithmetic, or the functions in the NumPy library, you don't need to use `apply`, because they automatically work on each element of an array.  But there are many things that don't.  The string manipulation we did in today's lab is one example.  Since you can write any code you want in a function, `apply` gives you total control over how you operate on data.

# 4. Histograms
Earlier, we computed the average total compensation among the CEOs in our 99-CEO dataset.  The average doesn't tell us everything about the amounts CEOs are paid, though.  Maybe just a few CEOs make the bulk of the money, even among these 99.

We can use a *histogram* method to display the *distribution* of a set of numbers.  The table method `hist` takes a single argument, the name of a column of numbers.  It produces a histogram of the numbers in that column.

<span style="color:blue">**Question 4.0.1**</span> Make a histogram of the total pay of the CEOs in `compensation`. 

<!--
BEGIN QUESTION
name: q41
-->

In [58]:
compensation.labels

In [59]:
...

<span style="color:blue">**Question 4.0.2**</span> How many CEOs made more than $20 million in total pay? Find the value using code, then check that the value you found is consistent with what you see in the histogram.

*Hint:* Use the table method `where` and the property `num_rows`.

<!--
BEGIN QUESTION
name: q42
-->

In [60]:
num_ceos_more_than_30_million_2 = ...
num_ceos_more_than_30_million_2

In [None]:
grader.check("q42")

# 5. Burrito-ful San Diego

Tam, Margaret and Winifred are trying to use Data Science to find the best burritos in San Diego! Their friends Irene and Maya provided them with two comprehensive datasets on many burrito establishments in the San Diego area taken from (and cleaned from): https://www.kaggle.com/srcole/burritos-in-san-diego/data

The following cell reads in a table called `ratings` which contains names of burrito restaurants, their Yelp rating, Google rating, as well as their Overall rating. It also reads in a table called `burritos_types` which contains names of burrito restaurants, their menu items, and the cost of the respective menu item at the restaurant.

In [62]:
#Just run this cell
ratings = Table.read_table("ratings.csv")
ratings.show(5)
burritos_types = Table.read_table("burritos_types.csv")
burritos_types.show(5)

<span style="color:blue">**Question 5.0.1**</span> It would be easier if we could combine the information in both tables. Assign `burritos` to the result of joining the two tables together.

*Note: it doesn't matter which table you put in as the argument to the table method, either order will work for the autograder tests.*

*Hint: If you need refreshers on table methods, look at the [python reference](http://data8.org/sp20/python-reference.html).*

<!--
BEGIN QUESTION
name: q5_1
-->

In [63]:
burritos = ...
burritos.show(5)

In [None]:
grader.check("q5_1")

<!-- BEGIN QUESTION -->

<span style="color:blue">**Question 5.0.2**</span> Let's look at how the Yelp scores compare to the Google scores in the `burritos` table. First, assign `yelp_and_google` to a table only containing the columns `Yelp` and `Google`. Then, make a scatter plot with Yelp scores on the x-axis and the Google scores on the y-axis. 

<!--
BEGIN QUESTION
name: q5_2
manual: True
-->

In [67]:
yelp_and_google = ...
...
# Don't change/edit/remove the following line.
# To help you make conclusions, we have plotted a straight line on the graph (y=x)
plt.plot(np.arange(2.5,5,.5), np.arange(2.5,5,.5));

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<span style="color:blue">**Question 5.0.3**</span> Looking at the scatter plot you just made in Question 1.2, do you notice any pattern(s) (i.e. is one of the two types of scores consistently higher than the other one)? If so, describe them **briefly** in the cell below.

<!--
BEGIN QUESTION
name: q5_3
manual: True
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



Here's a refresher on how `.group` works! You can read how `.group` works in the [textbook](https://www.inferentialthinking.com/chapters/08/2/Classifying_by_One_Variable.html), or you can view the video below. The video resource was made by a past staff member - Divyesh Chotai!

In [68]:
from IPython.display import YouTubeVideo
YouTubeVideo("HLoYTCUP0fc")

<span style="color:blue">**Question 5.0.4**</span> From the `burritos` table, some of the restaurant locations have multiple reviews. Winifred thinks California burritos are the best type of burritos, and wants to see the average overall rating for California burritos at each location. Create a table that has two columns: the name of the restaurant and the average overall rating of California burritos at each location.

*Tip: Revisit the burritos table to see how California burritos are represented.*

*Note: you can break up the solution into multiple lines, as long as you assign the final output table to `california_burritos`! For reference however, the staff solution only used one line.*

<!--
BEGIN QUESTION
name: q5_4
-->

In [69]:
california_burritos = ...
california_burritos

In [None]:
grader.check("q5_4")

<span style="color:blue">**Question 5.0.5**</span> Given this new table `california_burritos`, Winifred can figure out the name of the restaurant with the highest overall average rating! Assign `best_restaurant` to a line of code that evaluates to a string that corresponds to the name of the restaurant with the highest overall average rating. 

<!--
BEGIN QUESTION
name: q5_5
-->

In [73]:
best_restaurant = ...
best_restaurant

In [None]:
grader.check("q5_5")

<span style="color:blue">**Question 5.0.6**</span> Using the `burritos` table, assign `menu_average` to a table that has three columns that uniquely pairs the name of the restaurant, the menu item featured in the review, and the average Overall score for that menu item at that restaurant.

*Hint: Use .group, and remember that you can group by multiple columns. Here's an example from the [textbook](https://www.inferentialthinking.com/chapters/08/3/Cross-Classifying_by_More_than_One_Variable.html)*.

<!--
BEGIN QUESTION
name: q5_6
-->

In [76]:
menu_average = ...
menu_average

In [None]:
grader.check("q5_6")

<!-- BEGIN QUESTION -->

<span style="color:blue">**Question 5.0.7.**</span> Plot a histogram that visualizes that distribution of the costs of the burritos from the `burritos` table. Also use the provided `bins` variable when making your histogram, so that visually the histogram is more informative.

<!--
BEGIN QUESTION
name: q5_7
manual: True
-->

In [79]:
bins = np.arange(0, 15, 1)
# Please also use the provided bins
...

<!-- END QUESTION -->



# 6. Government salaries


This exercise is designed to give you practice using the Table methods `pivot` and `group`. [Here](http://data8.org/sp20/python-reference.html) is a link to the Python reference page in case you need a quick refresher.

Run the cell below to view a demo on how you can use pivot on a table. (Thank you to past staff Divyesh Chotai)

In [80]:
from IPython.display import YouTubeVideo
YouTubeVideo("4WzXo8eKLAg")

In the next cell, we load a dataset published by the Government of Alberta, which contains salary and severance data for Government of Alberta employees, containing their ministries, position titles, base salaries, severances (if applicable), and more. Let's only concern ourselves with their ministires, position titles, and base salaries.

In [81]:
positions = Table.read_table("alberta-salary-disclosure.csv").where("Year", are.equal_to(2020)).drop("PositionClass", "CashBenefits", "NonCashBenefits", "Severance")
positions = positions.relabeled("PositionTitle", "Position Title")
positions = positions.relabeled("BaseSalary", "Base Salary")
positions

We want to use this table to generate arrays with the names of each professor in each department.

<span style="color:blue">**Question 6.0.1.**</span> Set `organized_positions` to a table with two columns. The first column should be called `Ministry` and have the name of every ministry once, and the second column should be called `Positions` with each row in that second column containing an *array* of the names of all position titles in that ministry. 

*Hint:* Think about how ```group``` works: it collects values into an array and then applies a function to that array. We have defined two functions below for you, and you will need to use one of them in your call to ```group```.

<!--
BEGIN QUESTION
name: q6_1
-->

In [82]:
# Pick one of the two functions defined below in your call to group.
def identity(array):
    '''Returns the array that is passed through'''
    return array 

def first(array):
    '''Returns the first item'''
    return array.item(0)

# Make a call to group using one of the functions above when you define prof_names
organized_positions = ...
organized_positions

In [None]:
grader.check("q6_1")

#### Understanding the code you just wrote in 6.0.1 is important for moving forward with the class! If you made a lucky guess, take some time to look at the code, step by step.

<!-- BEGIN QUESTION -->

<span style="color:blue"> **Question 6.0.2.** </span>  In the original `positions` table, the ```BaseSalary``` column isn't sorted. Would the arrays you generated in the `Positions` column of the previous part be the same if we had sorted by base salary first instead before generating them? Two arrays are the **same** if they contain the same number of elements and the elements located at corresponding indexes in the two arrays are identical. An example of arrays that are NOT the same: `array([1,2]) != array([2,1])`. Explain your answer.  

<!--
BEGIN QUESTION
name: q6_2
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<span style="color:blue">**Question 6.0.3**</span> Set `ministry_ranges` to a pivot table containing year as the row, and the position as the columns. The values in the row should correspond to a salary range, where range is defined as the **difference between the highest salary and the lowest salary in the ministry for any position**. 

*Hint:* First you'll need to define a new function `salary_range` which takes in an array of salaries and returns the range of salaries in that array. 

<!--
BEGIN QUESTION
name: q6_3
manual: false
-->

In [90]:
# Define salary_range first
def salary_range(salaries): 
    ...

ministry_ranges = ...
ministry_ranges

In [None]:
grader.check("q6_3")

Great job! You're finished with lab 4! Be sure to...

* **run all the tests** (the next cell has a shortcut for that),
* **Save and Checkpoint** from the File menu,
* **run the last cell to submit your work**,
* and **ask one of the staff members to check you off**.

This lab is altered from the original [Berkeley data-8 course](http://data8.org/), which is licensed under the [Creative Commons license](https://creativecommons.org/licenses/by-nc/4.0/).

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export()