In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw04.ipynb")

<img style="display: block; margin-left: auto; margin-right: auto" src="./ccsf-logo.png" width="250rem;" alt="The CCSF black and white logo">

<div style="text-align: center;">
    <h1>Homework 4: Functions, Histograms, and Groups</h1>
    <em>View the related <a href="https://ccsf.instructure.com" target="_blank">Canvas</a> Assignment page for additional details.</em>
</div>

**Reading**: 

* [Visualizing Numerical Distributions](https://inferentialthinking.com/chapters/07/2/Visualizing_Numerical_Distributions.html) 
* [Functions and Tables](https://inferentialthinking.com/chapters/08/Functions_and_Tables.html)

Please complete this notebook by filling in the cells provided. Before you begin, execute the following cell to load the provided tests. Each time you start your server, you will need to execute this cell again to load the tests.


**Throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook!** For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Moreover, please be sure to only put your written answers in the provided cells. 

Run the following cell to import the relevant modules and settings.

In [None]:
import numpy as np
from datascience import *
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

import warnings
warnings.simplefilter('ignore', FutureWarning)
warnings.filterwarnings("ignore")

## San Francisco City and County Employee Salaries

In this homework assignment, you will focus on [Employee Compensation](https://data.sfgov.org/City-Management-and-Ethics/Employee-Compensation/88g8-5mnd) data provided by the SF Controller's Office. We have filtered it to retain just the relevant columns and restricted the data to the year 2022. Run the following cell to load our data into a table called `sf`.

In [None]:
sf = Table.read_table('sf_2022.csv')
sf

* Each line represents an employee's job information such as job family and salary. 
* They provide [a PDF explaining what each variable means](https://data.sfgov.org/api/views/88g8-5mnd/files/OMBVvreoXRjXG6oP4Ts4497dNxt14XlBqB6uIL6cq-o?download=True&filename=N:\EIS\DataCoordination\Metadata%20Spring%20Cleaning\CON_DataDictionary_Employee-Compensation.pdf).
* You can _approximately_ think of every row representing an employee. _Since a few employees transfer jobs throughout the year or hold multiple jobs, it is not exactly true that each row represents an employee._
* There are some jobs associated with a negative total compensation.

---

**The `sf` contains just over 40,000 rows of information, so it takes up a lot of the memory you have available. We highly recommend that you only run this notebook kernel while working on this assignment. You can review [Jupyter Lab's documentation on managing your running kernels](https://jupyterlab.readthedocs.io/en/stable/user/running.html) or ask someone in the class for help.**

---

### Task 01 📍🔎

<!-- BEGIN QUESTION -->

The departments do not all have the same number of jobs. 

1. Create a visualization that shows the distribution of departments represented in this data set in terms of the number of employees within each department.
2. Make sure the visualization highlights the departments with the most number of employees represented in the data set.
3. Provide the name of the department with the most number of employees.


_Points:_ 3

_Type your answer here, replacing this text._

In [None]:
...

# Leave this to provide a title to your visualization
plots.title('Departments (Employee Count)')
plots.show()

<!-- END QUESTION -->

### Task 02 📍

Some employees for the city and county make a lot of money. Identify who makes the most in terms of total compensation (the sum of salary and benefits).
1. Assign the largest total compensation (a `float` value) to `max_compensation`.
2. Assign the job title (a `str` value) associated with the highest total compensation to the name `max_compensation_position`.

_Points:_ 4

In [None]:
max_compensation = ...
max_compensation_position = ...
max_compensation, max_compensation_position

In [None]:
grader.check("task_02")

### Task 03 📍🔎

<!-- BEGIN QUESTION -->

1. Create a histogram showing the distribution of total compensation. 
2. Use the `unit = '$'` argument and use the default bins.
3. Explain why there seems to be no information visualized on the right-side of the histogram.


_Points:_ 4

_Type your answer here, replacing this text._

In [None]:
...

# Leave this to provide a title to your visualization
plots.title('Total Compensation')
plots.show()

<!-- END QUESTION -->

### Task 04 📍🔎

<!-- BEGIN QUESTION -->

The area of a bar in a histogram reflect the percentage of the data represented in that particular bar. To have you think more deeply about this, we want you to create a strange version of the histogram you just made.

1. Create an array called `equal_split_bins` that can be used to form the bins for a histogram of total compensation that splits the data into two equal bins.
    * _Hint: The median of a collection of numbers can be used to split the numbers in two equal halves._
1. Create a histogram for the total compensation values by using `equal_split_bins` such that your histogram only shows 2 bins (bars).
2. The height of the bins should not be the same, explain why the visualization shows the data is split equally into two bins.

_Points:_ 4

_Type your answer here, replacing this text._

In [None]:
# Define the bins
equal_split_bins = ...
# Create the histogram
...

# Leave this to provide a title to your visualization
plots.title('Total Compensation')
plots.show()

<!-- END QUESTION -->

The visualization of numerical data formed using a histogram can varied wildly depending on how you bin the data, but both of the above visualizations represent the same data!

Continuing, you will be exploring the data set and demonstrating some more of the things you've learned about so far in the class!

### Task 05 📍

California has laws in place to help govern how much an employee should be paid for overtime work. There could be several reasons for why an employee works overtime. Some are healthy reasons and some are not. The ratio of overtime compensation to total compensation can provide a signal for health of departments and their employees. According to [Indeed.com](https://www.indeed.com/career-advice/career-development/working-overtime), here are some potential disadvantages of working extra hours:

> Focus loss:
>
> You will likely want to take breaks while working overtime, and you may lose focus and productivity naturally as your working hours increase. 
> 
> Safety and health risk:
>
> Working longer hours also can be dangerous, depending on the job. Working overtime regularly can also disrupt your work-life balance, lead to burnout or create health risks, such as sitting at a computer for long periods. Due to these risks, more companies are limiting the number of hours worked in certain positions, such as truck drivers. 
> 
> Less work-life balance:
>
> There are only 24 hours in the day, and working overtime reduces the time for a good work-life balance. More work hours mean fewer hours for family, relaxation and sleep. 

Let's see which employees are earning a lot of overtime relative to the their total compensation.

1. Filter the data in `sf` to only include rows with a total compensation above $20,000. Name the resulting table, `sf_above_20k`.
1. Create an array `overtime_ratios` that contains the ratio of overtime pay to total compensation pay for all the employees in the data set.
2. Using that array, create a table called `overtime_ratio_top_10` with a column called `Overtime Ratio` that shows the top 10 employees that have the highest overtime pay to total compensation pay ratio.

_Points:_ 6

In [None]:
sf_above_20k = ...
overtime_ratios = ...
overtime_ratio_top_10 = ...
overtime_ratio_top_10

In [None]:
grader.check("task_05")

### Task 06 📍

Set `job_titles` to a table with two columns. 
* The first column should be called "Organization Group".
* The first column should have the name of every "Organization Group" once
* The second column should be called "Jobs" with each row in that second column containing an *array* of the names of all the job titles within that "Organization Group". 
* Don't worry if there are multiple of the same job titles.

Consider a few things while working on this:
* Think about how `group` works: it collects values into an array and then applies a function to that array. We have defined two functions below for you, and you will need to use one of them in your call to `group`. 
* It might be helpful to create intermediary tables and experiment with the given functions.

_Points:_ 6

In [None]:
# Pick one of the two functions defined below in your call to group.
def first_item(array):
    '''Returns the first item'''
    return array.item(0)

def full_array(array):
    '''Returns the array that is passed through'''
    return array 

# Make a call to group using one of the functions above when you define job_titles
job_titles = ...
job_titles

In [None]:
grader.check("task_06")

---

Understanding the code you just wrote in the previous task is important for moving forward with the class! If you made a lucky guess, take some time to look at the code, step by step.

---

### Task 07 📍🔎

<!-- BEGIN QUESTION -->

At the moment, the `Job` column of the `sf` table is not sorted (no particular order). Would the arrays you generated in the `Jobs` column of the previous question be the same if we had sorted alphabetically instead before generating them? Explain your answer. To receive full credit, your answer should reference
1. *how* the `.group` method works
2. *how* sorting the `Job` column would affect this.

Keep in mind that two arrays are the **same** if they contain the same number of elements and the elements located at corresponding indexes in the two arrays are identical. An example of arrays that are NOT the same: `array([1,2]) != array([2,1])`.


_Points:_ 2

_Type your answer here, replacing this text._

<!-- END QUESTION -->

### Task 08 📍

Set `department_ranges` to a table containing departments as the rows, and the organization groups as the columns. The values in the rows should correspond to a total compensation range, where range is defined as the **difference between the highest total compensation and the lowest total compensation in the department for that organization group**.

Keep in mind the following while working on this:

* First you'll need to define a new function `compensation_range` which takes in an array of compensations and returns the range of compensations in that array.
* What table function allows you to specify the rows and columns of a new table?


_Points:_ 4

In [None]:
# Define compensation_range first
...
    ...

department_ranges = ...
department_ranges

In [None]:
grader.check("task_08")

### Task 09 📍🔎

<!-- BEGIN QUESTION -->

Provide at least **two** different explanations as to why some of the row values are `0` in the `department_ranges` table from the previous question.


_Points:_ 2

_Type your answer here, replacing this text._

<!-- END QUESTION -->

### Task 10 📍

Find the number of departments appearing in the `sf` table that have an average total compensation of greater than 125,000 dollars; assign this value to the variable `num_over_125k`.

The variable names we've provided below are meant to help guide the intermediate steps and general thought process. Feel free to delete them if you'd prefer to start from scratch, but make sure your final answer is assigned to `num_over_125k`!

_Points:_ 2

In [None]:
depts_and_comp = ...
avg_of_depts = ...
over_125k = ...
num_over_125k = ...
num_over_125k

In [None]:
grader.check("task_10")

There are many things you can explore in this data set, but that is enough for now!

## Submit your Homework to Canvas

Once you have finished working on the homework tasks, prepare to submit your work in Canvas by completing the following steps.

1. In the related Canvas Assignment page, check the rubric to know how you will be scored for this assignment.
2. Double-check that you have run the code cell near the end of the notebook that contains the command `"grader.check_all()"`. This command will run all of the run tests on all your responses to the auto-graded tasks marked with 📍.
3. Double-check your responses to the manually graded tasks marked with 📍🔎.
3. Select the menu item "File" and "Save Notebook" in the notebook's Toolbar to save your work and create a specific checkpoint in the notebook's work history.
4. Select the menu items "File", "Download" in the notebook's Toolbar to download the notebook (.ipynb) file. 
5. In the related Canvas Assignment page, click Start Assignment or New Attempt to upload the downloaded .ipynb file.

**Keep in mind that the autograder does not always check for correctness. Sometimes it just checks for the format of your answer, so passing the autograder for a question does not mean you got the answer correct for that question.**

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()