In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab04.ipynb")

<img src="./ccsf.png" alt="CCSF Logo" width=200px style="margin:0px -5px">

# Lab 04: Visualizations and Functions

## References

* [Sections 7.0 - 7.3](https://inferentialthinking.com/chapters/07/Visualization.html)
* [Sections 8.0 - 8.1](https://inferentialthinking.com/chapters/08/Functions_and_Tables.html)
* [datascience Documentation](https://datascience.readthedocs.io/)
* [Markdown Cheat Sheet](https://www.markdownguide.org/cheat-sheet/)

---

## Lab Assignment Reminders

- 🚨 Make sure to run the code cell at the top of this notebook that starts with `# Initialize Otter` to load the auto-grader.
- Your tasks are categorized as auto-graded (📍) and manually graded (📍🔎):
    - **For all auto-graded tasks:**
        - Replace the `...` in the provided code cell with your own code.
        - Run the `grader.check` code cell to execute tests on your code.
        - There are no hidden auto-grader tests in the lab assignments. This means if you pass the tests, you can assume you've completed the task successfully.
    - **For all manually graded tasks:**
        - You may need to provide your own response to the provided prompt. Replace the template text "_Type your answer here, replacing this text._" with your own words.
        - You might need to produce a graphic or another output using code. Replace the `...` in the code cell to generate the image, table, etc.
        - In either case, check your response with a classmate, a tutor, or the instructor before moving on.
- Throughout this assignment and all future ones, please **do not re-assign variables** throughout the notebook! _For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Otherwise, you may fail tests that you thought you were passing previously!_
- You may [submit](#Submit-Your-Assignment-to-Canvas) this assignment as many times as you want before the deadline. Your instructor will score the last version you submit once the deadline has passed.
- **Collaborating on labs is encouraged!** You should rarely remain stuck for more than a few minutes on questions in labs, so ask an instructor or classmate for help. (Explaining things is beneficial, too -- the best way to solidify your knowledge of a subject is to explain it.) However, please don't just share answers.

---

## Configure the Notebook

Run the following cell to configure this Notebook.

In [None]:
import numpy as np
from datascience import *
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

# Interactive Widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

---

## Defining functions

Start this lab by learning how to create a very simple function that converts a proportion to a percentage by multiplying it by 100.  For example, the value of `to_percentage(.5)` should be the number 50 (no percent sign).

A function definition has a few parts.

---

### `def`

It always starts with `def` (short for **def**ine):

``` python
def
```

---

### Name

Next comes the name of the function.  Like other names we've defined, it can't start with a number or contain spaces. Let's call our function `to_percentage`:

``` python
def to_percentage
```

---

### Signature

Next comes something called the *signature* of the function.  This tells Python how many arguments your function should have, and what names you'll use to refer to those arguments in the function's code.  A function can have any number of arguments (including 0!). 

`to_percentage` should take one argument, and we'll call that argument `proportion` since it should be a proportion.

``` python
def to_percentage(proportion)
```

If we want our function to take more than one argument, we add a comma between each argument name. Note that if we had zero arguments, we'd still place the parentheses () after than name. 

We put a colon after the signature to tell Python it's over. If you're getting a syntax error after defining a function, check to make sure you remembered the colon!

``` python
def to_percentage(proportion):
```

---

### Documentation

Functions can do complicated things, so you should write an explanation of what your function does.  For small functions, this is less important, but it's a good habit to learn from the start.  Conventionally, Python functions are documented by writing an **indented** triple-quoted string:

``` python
def to_percentage(proportion):
    """Converts a proportion to a percentage."""
```

---

### Body

Now we start writing code that runs when the function is called.  This is called the *body* of the function and every line **must be indented with a tab or 4 spaces**.  Any lines that are *not* indented and left-aligned with the def statement is considered outside the function. 

Some notes about the body of the function:
- We can write code that we would write anywhere else.  
- We use the arguments defined in the function signature. We can do this because we assume that when we call the function, values are already assigned to those arguments.
- We generally avoid referencing variables defined *outside* the function. If you would like to reference variables outside of the function, pass them through as arguments!


Now, let's give a name to the number we multiply a proportion by to get a percentage:

``` python
def to_percentage(proportion):
    """Converts a proportion to a percentage."""
    factor = 100
```

---

### `return`

The special instruction `return` is part of the function's body and tells Python to make the value of the function call equal to whatever comes right after `return`.  We want the value of `to_percentage(.5)` to be the proportion .5 times the factor 100, so we write:

``` python
def to_percentage(proportion):
    """Converts a proportion to a percentage."""
    factor = 100
    return proportion * factor
```
        
`return` only makes sense in the context of a function, and **can never be used outside of a function**. If present, `return` is always the last line of the function because Python stops executing the body of a function once it hits a `return` statement.

*Note:*  `return` inside a function tells Python what value the function evaluates to. However, there are other functions, like `print`, that have no `return` value. For example, `print` simply prints a certain value out to the console. 

`return` and `print` are **very** different. 

---

### Task 01 📍

1. Define `to_percentage` in the cell below.
2. Call your function to convert the proportion 0.2 to a percentage and name the output `twenty_percent`.


In [None]:
def ...
    """"
    Converts a proportion to a percentage.
    
    >>> to_percentage(0.25)
    25.0
    """
    return ...
    
twenty_percent = ...
twenty_percent


In [None]:
grader.check("task_01")

Like you’ve done with built-in functions in previous labs (max, abs, etc.), you can pass in named values as arguments to your function.

---

### Task 02 📍

Use `to_percentage` again to convert the proportion named `a_proportion` (defined below) to a percentage called `a_percentage`.

*Note:* You don't need to define `to_percentage` again!  Like other named values, functions stick around after you define them.


In [None]:
a_proportion = 2 ** (0.5) / 2
a_percentage = ...
a_percentage

In [None]:
grader.check("task_02")

Here's something important about functions: the names assigned *within* a function body are only accessible within the function body. Once the function has returned, those names are gone.  So even if you created a variable called `factor` and defined `factor = 100` inside of the body of the `to_percentage` function and then called `to_percentage`, `factor` would not have a value assigned to it outside of the body of `to_percentage`.

As we've seen with built-in functions, functions can also take strings (or arrays, or tables) as arguments, and they can return those things, too.

---

### Task 03 📍

Define a function called `disemvowel`.  It should take a single string as its argument.  (You can call that argument whatever you want.)  It should return a copy of that string, but with all the characters that are vowels removed.  (In English, the vowels are the characters "a", "e", "i", "o", and "u".) You can use as many lines inside of the function to do this as you’d like.

*Hint:* To remove all the "a"s from a string, you can use `that_string.replace("a", "")`.  The `.replace` method for strings returns a new string, so you can call `replace` multiple times, one after the other. 


In [None]:
def disemvowel(a_string):
    """
    Removes all vowels from a string.
    
    >>> disemvowel('datascience')
    'dtscnc'
    """ 
    ...

# Checking to see if your function works for an example string.
disemvowel("Can you read this without vowels?")

In [None]:
grader.check("task_03")

---

### Calls on calls on calls

Just as you write a series of lines to build up a complex computation, it's useful to define a series of small functions that build on each other.  Since you can write any code inside a function's body, you can call other functions you've written.

If a function is a like a recipe, defining a function in terms of other functions is like having a recipe for cake telling you to follow another recipe to make the frosting, and another to make the jam filling.  This makes the cake recipe shorter and clearer, and it avoids having a bunch of duplicated frosting recipes.  It's a foundation of productive programming.

For example, suppose you want to count the number of characters *that aren't vowels* in a piece of text.  One way to do that is this to remove all the vowels and count the size of the remaining string.

---

### Task 04 📍

Write a function called `num_non_vowels`.  It should take a string as its argument and return a number.  That number should be the number of characters in the argument string that aren't vowels. You should use the `disemvowel` function you wrote above inside of the `num_non_vowels` function.

*Hint:* The function `len` takes a string as its argument and returns the number of characters in it.


In [None]:
def num_non_vowels(a_string):
    """
    The number of characters in a string, minus the vowels.
    
    >>> num_non_vowels('datascience')
    
    6
    """
    ...

# Try running an example function call of num_non_vowels below to test out your code.
...

In [None]:
grader.check("task_04")

Functions can also encapsulate code that *displays output* instead of computing a value. For example, if you call `print` inside a function, and then call that function, something will get printed.

---

## Functions and CEO Incomes

In this question, we'll look at the 2015 compensation of CEOs at the 100 largest companies in California. The data was compiled from a [Los Angeles Times analysis](http://spreadsheets.latimes.com/california-ceo-compensation/), and ultimately came from [filings](https://www.sec.gov/answers/proxyhtf.htm) mandated by the SEC from all publicly-traded companies. Two companies have two CEOs, so there are 102 CEOs in the dataset.

We've copied the raw data from the LA Times page into a file called `raw_compensation.csv`. (The page notes that all dollar amounts are in **millions of dollars**.)

In [None]:
raw_compensation = Table.read_table('raw_compensation.csv')
raw_compensation

Computing the average of the CEOs' pay might seem like a simple task, but there are some issues with this data that make that task a bit challenging. View the error produced from the following code:

<img src="./totalpay_error.png" alt="The UFuncTypeError the occurs from running np.average(raw_compensation.column('Total Pay'))">

Let's examine why this error occurred by looking at the values in the `Total Pay` column. 

---

### Task 05 📍

Use the `type` function and set `total_pay_type` to the type of the first value in the "Total Pay" column.


In [None]:
total_pay_type = ...
total_pay_type

In [None]:
grader.check("task_05")

---

### Task 06 📍

You should have found that the values in the `Total Pay` column are strings. It doesn't make sense to take the average of string values, so we need to convert them to numbers if we want to do this. Extract the first value in `Total Pay`.  It's Mark Hurd's pay in 2015, in *millions* of dollars.  Call it `mark_hurd_pay_string`.


In [None]:
mark_hurd_pay_string = ...
mark_hurd_pay_string

In [None]:
grader.check("task_06")

---

### Task 07 📍

Convert `mark_hurd_pay_string` to a number of *dollars*. 

Some hints, as this question requires multiple steps:
- The string method `strip` will be useful for removing the dollar sign; it removes a specified character from the start or end of a string.  For example, the value of `"100%".strip("%")` is the string `"100"`.  
- You'll also need the function `float`, which converts a string that looks like a number to an actual number.  
- Finally, remember that the answer should be in dollars, not millions of dollars.


In [None]:
mark_hurd_pay = ...
mark_hurd_pay

In [None]:
grader.check("task_07")

To compute the average pay, we need to do this for every CEO.  But that looks like it would involve copying this code 102 times.

This is where functions come in.  First, we'll define a new function, giving a name to the expression that converts "total pay" strings to numeric values.  Later in this lab, we'll see the payoff: we can call that function on every pay string in the dataset at once.

The next section of this lab explains how to define a function For now, just fill in the ellipses in the cell below.

---

### Task 08 📍

Copy the expression you used to compute `mark_hurd_pay`, and use it as the return expression of the function below. But make sure you replace the specific `mark_hurd_pay_string` with the generic `pay_string` name specified in the first line in the `def` statement.

*Hint*: When dealing with functions, you should generally not be referencing any variable outside of the function. Usually, you want to be working with the arguments that are passed into it, such as `pay_string` for this function. If you're using `mark_hurd_pay_string` within your function, you're referencing an outside variable!


In [None]:
def convert_pay_string_to_number(pay_string):
    """
    Converts a pay string like '$100' (in millions) to a number of
    dollars.
    
    >>> convert_pay_string_to_number("$100 ")
    100000000.0
    """
    ...

In [None]:
grader.check("task_08")

Running that cell doesn't convert any particular pay string. Instead, it creates a function called `convert_pay_string_to_number` that can convert *any* string with the right format to a number representing millions of dollars.

We can call our function just like we call the built-in functions we've seen. It takes one argument -- a string -- and it returns a float.

In [None]:
convert_pay_string_to_number('$42')

In [None]:
convert_pay_string_to_number(mark_hurd_pay_string)

We can also compute Safra Catz's pay in the same way. Run the following cell to see the results.

In [None]:
convert_pay_string_to_number(raw_compensation.where("Name", are.containing("Safra")).column("Total Pay").item(0))

So, what have we gained by defining the `convert_pay_string_to_number` function? 
Well, without it, we'd have to copy the code `10**6 * float(some_pay_string.strip("$"))` each time we wanted to convert a pay string.  Now we just call a function whose name says exactly what it's doing.

---

## `apply`ing functions

Defining a function is a lot like giving a name to a value with `=`.  In fact, a function is a value just like the number 1 or the text "data"!

For example, we can make a new name for the built-in function `max` if we want:

In [None]:
our_name_for_max = max
our_name_for_max(2, 6)

The old name for `max` is still around:

In [None]:
max(2, 6)

Try just writing `max` or `our_name_for_max` (or the name of any other function) in a cell, and run that cell.  Python will print out a (very brief) description of the function.

In [None]:
max

Now try writing `?max` or `?our_name_for_max` (or the name of any other function) in a cell, and run that cell.  A information box should show up at the bottom of your screen a longer description of the function

*Note: You can also press Shift+Tab after clicking on a name to see similar information!*

In [None]:
?our_name_for_max

Let's look at what happens when we set `max`to a non-function value. You'll notice that a TypeError will occur when you try calling `max`. Things like integers and strings are not callable. Look out for any functions that might have been renamed when you encounter this type of error

<img src="./rename_function_error.png" alt="The TypeError that occurs when max is used to name a value and then used as a function.">

The following cell resets `max` to the built-in function. Just run this cell to reset `max` just in case you changed it.

In [None]:
#Just run this cell, don't change its contents
import builtins
max = builtins.max

Why is this useful?  Since functions are just values, it's possible to pass them as arguments to other functions.  Here's a simple but not-so-practical example: we can make an array of functions.

In [None]:
make_array(max, np.average, are.equal_to)

Working with functions as values can lead to some funny-looking code. For example, see if you can figure out why the following code works. Check your explanation with a neighbor or a staff member.

In [None]:
make_array(max, np.average, are.equal_to).item(0)(4, -2, 7)

A more useful example of passing functions to other functions as arguments is the table method `apply`.

`apply` calls a function many times, once on *each* element in a column of a table.  It produces an *array* of the results.  Here we use `apply` to convert every CEO's pay to a number, using the function you defined:

In [None]:
raw_compensation.apply(convert_pay_string_to_number, "Total Pay")

Here's an illustration of what that did:

<img src="./apply.png"/>

Note that we didn’t write `raw_compensation.apply(convert_pay_string_to_number(), “Total Pay”)` or `raw_compensation.apply(convert_pay_string_to_number(“Total Pay”))`. We just passed the name of the function, with no parentheses, to `apply`, because all we want to do is let `apply` know the name of the function we’d like to use and the name of the column we’d like to use it on. `apply` will then call the function `convert_pay_string_to_number` on each value in the column for us!

---

### `compensation` Table

### Task 09 📍

Using `apply`, make a table that's a copy of `raw_compensation` with one additional column called `Total Pay ($)`.  That column should contain the result of applying `convert_pay_string_to_number` to the `Total Pay` column (as we did above).  Call the new table `compensation`.


In [None]:
compensation = raw_compensation.with_column(
    "Total Pay ($)",
    ...
    ) 
compensation

In [None]:
grader.check("task_09")

Now that we have all the pays as numbers, we can finally calculate the average from early.

---

### Task 10 📍

Compute the average total pay of the CEOs in the dataset.

In [None]:
average_total_pay = ...
average_total_pay

In [None]:
grader.check("task_10")

---

### Why is `apply` useful?

For operations like arithmetic, or the functions in the NumPy library, you don't need to use `apply`, because they automatically work on each element of an array.  But there are many things that don't.  The string manipulation we did in today's lab is one example.  Since you can write any code you want in a function, `apply` gives you total control over how you operate on data.

---

## Visualizations and Attributes

Visualizing data is an essential step in gaining insights from the vast and complex datasets that permeate our modern world. There exists a myriad of techniques and tools to transform raw data into comprehensible, meaningful representations. Among these techniques, a set of standard visualizations has emerged as go-to options, each with its unique strengths and applications. The choice of which standard visualization to employ hinges on various factors, a key factor among them being the attribute type of the data under investigation. It's important to note that this attribute type may not always align with the data type in which information is stored, making the selection of an appropriate visualization an artful and pivotal decision in the data analysis process.

To streamline our understanding of attribute types for data visualization, we can simplify them into two broad categories: **numerical** and **categorical**. 
* Numerical attributes encompass data that consists of continuous or discrete numeric values. These attributes are typically quantitative and can be operated on mathematically. Examples of numerical attributes include variables like age, temperature, or income.
* Categorical attributes deal with data that fall into distinct categories or labels. They represent qualitative information where mathematical operations typically do not have clear meanings. Examples of categorical attributes include gender (male, female, nonbinary), color (red, blue, green), or product categories (electronics, clothing, food).

By classifying attributes into these two fundamental types, we can better tailor our choice of visualization methods to the nature of the data, allowing us to extract more valuable insights from it.

---

### Numbers for Categories

Just because an attribute has values that are numbers, does not mean you should treat the attribute as numerical. Postal codes are numbers. However, the attribute type for postal codes is categorical rather than numerical. Postal codes represent specific geographical regions and are not meant for mathematical operations like addition or subtraction. For example, `90210` (Beverly Hills) and `10001` (Manhattan) are categorical values representing different locations. Choosing an appropriate visualization method for postal codes would involve treating them categorically, not numerically, despite their data type.

---

### Task 11 📍

Which of the following attributes are categorical in nature? Assign `categorical_attributes` to an array with the numbers for the variables that represent categorical attributes.

1. Height in centimeters
2. Eye color (e.g., blue, brown, green)
3. Temperature in degrees Celsius
4. Years of education completed
5. Vehicle make and model (e.g., Toyota Camry)
6. Employee identification number
7. Blood type (e.g., A, B, AB, O)
8. Stock prices
9. Time of day (e.g., morning, afternoon, evening)
10. Mobile phone number

In [None]:
categorical_attributes = ...

In [None]:
grader.check("task_11")

---

## Bar Charts

When it comes to visualizing categorical data, one of the standard and effective methods is the use of bar charts. Bar charts allow us to represent categorical variables by displaying their frequencies or proportions as bars of different lengths or heights. 

In Python, you can create bar charts easily using libraries like `matplotlib` or `datascience`. Specifically, the `datascience` library provides the `bar` and `barh` table methods, which simplify the process of generating bar charts. The `bar` method is used for vertical bar charts, while the `barh` method is employed for horizontal bar charts.

Run the following code cell to create the table `car_inventory`.

In [None]:
car_inventory = Table().with_columns(
    'Car Type', ['Sedan', 'SUV', 'Truck', 'Hatchback'],
    'Count', [25, 15, 12, 8]
)
car_inventory

Using the `barh` table method, you can create a horizontal bar chart to visualize the distribution of car types in `car_inventory`.

Run the following code cell to generate the bar chart.

In [None]:
car_inventory.barh('Car Type', 'Count')

# Optional Customization
plt.title('Distribution of Car Types')
plt.show()

This visualization method is invaluable for quickly grasping the distribution and comparison of categorical data.

The `barh` method has a few arguments, but two important ones are the first two.

* The first argument `column_for_categories` specifies which column in the table to use for the categorical values. In this case, `column_for_categories='Car Type'`.
* The second argument `select` specifies what values to use as the length of the bars. If you don't specify this argument, then it will try to use columns with numerical data to generate the bars. In this case, `select='Count'`.

There are other arguments to experiment with, but you will typically just need to work with these two.


### Using `group`

It is common for data to be stored in a way such that each line of the data represents one observation, so as an analyst, you usually need to create a table like `car_inventory` that has summary information. One tool to help with this creation is the `group` table method.

`tbl.group(col_name)` creates a new table from `tbl`, where each row corresponds to a unique value from the specified column. By default, the `group` function also adds a column to the new table that shows the count of occurrences for each unique value in `tbl`.

For example, the following table `customers` shows purchases for various customers where each line shows a single transaction.

In [None]:
customers = Table().with_columns(
    'Customer ID', [101, 102, 101, 104, 105, 106, 107, 
                    108, 109, 110, 101, 105, 109, 106, 102],
    'Purchase', ['Laptop', 'Phone', 'Tablet', 'Laptop', 
                 'Phone', 'Tablet', 'Headphones', 'Laptop', 
                 'Phone', 'Headphones', 'Phone', 'Laptop', 
                 'Tablet', 'Laptop', 'Headphones'],
    'Amount ($)', [1200, 800, 400, 1300, 850, 420, 150, 
                   1100, 900, 200, 700, 1250, 450, 1150, 180]
)
customers

Customer 101 has made three purchases. Using `group` and `sort`, it becomes easier to see this. Run the following cell to see the grouped table.

In [None]:
customers_by_ID = customers.group('Customer ID').sort('count', True)
customers_by_ID 

With the data organized in this way, you can visualize the customer purchase counts. Run the following code cell to see the results.

In [None]:
customers_by_ID.barh('Customer ID')

plt.title('Customer Purchase Counts')
plt.show()

---

---

### `movies_by_year` Table

The `top_movies_1995_2022.csv` dataset has information about movie sales in recent years. Run the following cell to load that data as the Table `movies_by_year`.

In [None]:
movies_by_year = Table.read_table("top_movies_1995_2022.csv")
movies_by_year

---

### Task 12 📍🔎

<!-- BEGIN QUESTION -->

Using the [`movies_by_year` table](#movies_by_year-Table), create a bar chart showing the distribution of the movie distributors where the length of the bar reflects the number of movies in the table for the given distributor. Your bars should be sorted such that the longest bar is at the top of the graphic.

_Make sure to check your visualization with a classmate, a tutor, or the instructor before moving on since there is no auto-grader for this lab task._

In [None]:
# Generate your chart in this cell
movies_by_distributor = movies_by_year.group(...).sort(...)
...

plt.title('Movie Distributor Counts')
plt.show()

<!-- END QUESTION -->

---

## Histograms

When it comes to visualizing numerical data, one of the default and fundamental techniques is to use a histogram. Histograms provide a graphical representation of the distribution of numerical values within a dataset, allowing you to observe patterns, central tendencies, and variations. 

In the `datascience` library, you can create histograms conveniently using the `hist` table method. 

When working with histograms, it's essential to consider the choice of bin sizes or intervals, as this can impact the interpretation of the data. 

The `hist` method defaults to displaying data density on the vertical axis rather than raw counts. This means that the height of each bar in the histogram represents the density of data within that bin, and the area of the bar, not the count, reflects the amount of data. This distinction is crucial for accurately understanding the distribution of numerical data and is a core concept in data visualization and analysis.

Run the following code cell to create a table called `ages` contain the ages of a group of people.

In [None]:
ages = Table().with_column(
    'Age', [12, 15, 18, 20, 22, 25, 26, 28, 30, 32, 35, 36, 38, 40, 45, 50, 55]
)
ages

To visualize this distribution, you can use the code `ages.hist('Age')`. `hist` has several arguments, but the first argument `columns` identifies the column(s) that contain the numerical data for the histogram.

In [None]:
ages.hist('Age')

# Optional Customization
plt.title('Distribution of Ages')
plt.show()

The `hist` table method offers two other arguments that are worth mentioning for this class: `bins` and `unit`. These arguments play a pivotal role in customizing the appearance and interpretation of the histogram. 

* The `bins` argument allows you to specify the number of bins or intervals into which the data range will be divided. A well-chosen number of bins can significantly affect the visual representation of the data, influencing the granularity of the histogram. By default, there is an algorithm that attempts to generate "good" bins, but you might need to specifically define the bins with an array or the number of bins with an integer to get the histogram to look good for your situation.
* The `unit` argument provides a way to provide labels to the horizontal and vertical axes as a reminder of what the units of the data are.

The ages are most likely measured in years and it might make sense to bin these ages by creating bins that are 10 years wide. You can achieve this with the parameters `unit="Years"` and `bins=np.arange(10, 61, 10)`. 

Run the following code cell to see the results.

In [None]:
ages.hist('Age', unit="Years", bins=np.arange(10, 61, 10))

# Optional Customization
plt.title('Distribution of Ages')
plt.show()

Notice how the shape of the histogram changes! The same numerical data can look very different in a histogram depending on how it is binned. This can be used as a tool for analysis and inquiry, but it can also be used as a tool to misguide.

---

### Task 13 📍🔎

<!-- BEGIN QUESTION -->

Use the `hist` method with the [`compensation` table](#compensation-Table) to show the distribution of CEO total pay amounts. The default settings for `hist` make a reasonable graphic, but you can try adjusting the `bins` parameter to see what happens.

_Make sure to check your visualization with a classmate, a tutor, or the instructor before moving on since there is no auto-grader for this lab task._

In [None]:
# Generate your chart in this cell
...

plt.title('Distribution of Corporate Contributions')
plt.show()

<!-- END QUESTION -->

---

## Line Plots and Scatter Plots

When it comes to visualizing numerical relationships in data, scatter plots and line plots are two fundamental tools that provide valuable insights. 

In the `datascience` library, the `scatter` method is used to create scatter plots, while the `plot` method is employed to generate line plots. 

These visualization techniques share a conceptual similarity: both display data points on a two-dimensional plane, typically with one numerical variable on the x-axis and another on the y-axis. However, the key distinction lies in the purpose and interpretation.

* Scatter plots are versatile and are primarily used to showcase the association and general pattern between two numerical variables. They are excellent for revealing relationships, correlations, and outliers in the data.
* Line plots are best suited when the horizontal axis represents sequential data, such as time or distance. These plots connect the data points with lines, making them ideal for visualizing trends and showing how a numerical variable changes over a continuous range, as in the case of tracking revenue over time.

In essence, scatter plots excel at depicting associations, while line plots are tailor-made for illustrating trends and sequential relationships in numerical data.

For example, run the following code cell to generate a table called `company_data` showing revenue and profit data for the last ten years for some **fake** company.

In [None]:
# Generate random data
np.random.seed(0)
years = np.arange(2014, 2024)
revenue = np.random.poisson(14, 10) * 2_500
profit = revenue * np.random.normal(0.08, 0.002, 10)

company_data = Table().with_columns('Year', years, 'Revenue', revenue, 'Profit', profit)
company_data

A line plot would be a standard choice to visualize the trend of profit over time. This can be done with the command `company_data.plot('Year', 'Profit')`.

Run the following code cell to see the results.

In [None]:
company_data.plot('Year', 'Profit')

# Optional Customization
plt.title('Profits Over Time')
plt.show()

The trend of this line shows that there was something very significant that happened around 2018 to make the company very profitable. After a short period of time of high profits, there was a steep decline in profit levels that were lower than in years previous to 2018. For the last few years, the company's profits seem to lack stability. This is likely due to economic instability surrounding the pandemic, but all this profit data was made up and so is money. 🤓

It might be nice to compare the trends of two numerical distributions over the same horizontal axis. This would be a great time to try an overlaid line plot. For example, you could plot the lines for both profit and revenue over time.

The `plot` method can handle this by making sure the table only contains the variables you are interested in `'Year'`, `'Profit'`, and `'Revenue'` and just specifying the horizontal axis in the `plot` method. For example, just use `company_data.plot('Year')`.

Run the following cell to see that a line is created for every numerical column in the table other than `'Year'`.

In [None]:
company_data.plot('Year')

# Optional Customization
plt.title('Revenue and Profits Over Time')
plt.show()

Revenue looks much less stable on this graph because of the scale of the values. Profits were hovering around 8% of revenue, so putting both lines on the y-axis doesn't offer a fair comparison.

How do you better see the relationship between revenue and profit? Since the data are not sequential and you are just looking to visualize the association, use a scatter plot. The `scatter` method would help out with this. Since profit follows from revenue, it is standard practice to have the horizontal axis reflect revenue values. Use the command `company_data.scatter('Revenue', 'Profit')` to make this happen.

Run the following cell to see the results.

In [None]:
company_data.scatter('Revenue', 'Profit')

# Optional Customization
plt.title('Revenue vs. Profit')
plt.show()

This shows a pretty strong (linear) positive relationship between revenue and profit. Dividing the profit values by the revenue shows a pretty stable profit margin of roughly 8%.

In [None]:
profit_margins = company_data.column('Profit') / company_data.column('Revenue')
print('The profit margins (profile/revenue) are:\n')
display(profit_margins)
average_profit_margin = "{:.2%}".format(np.average(profit_margins))
print(f'\nThe average profit margin over this period is {average_profit_margin}')

---

### Task 14 📍🔎

<!-- BEGIN QUESTION -->

For this task, return to the [`movies_by_year` Table](#movies_by_year-Table) and create a line plot using the `plot` table method to visualize the trend of the number of tickets sold over time. What do you notice about the line graph in terms of trends in the number of tickets sold over time?

_Make sure to check your visualization with a classmate, a tutor, or the instructor before moving on since there is no auto-grader for this lab task._

_Type your answer here, replacing this text._

In [None]:
# Generate your chart in this cell
...

# Customization
plt.title('Contributions Over Time')
plt.show()

<!-- END QUESTION -->

---

### Task 15 📍🔎

<!-- BEGIN QUESTION -->

For this task return to the [`compensation` table](#compensation-Table). Create a scatterplot showing the relationship between the total pay for a CEO and the ratio of the CEO's pay to the average industry worker's pay. What do you notice?

_Make sure to check your response with a classmate, a tutor, or the instructor before moving on since there is no auto-grader for this lab task._

_Type your answer here, replacing this text._

In [None]:
# Generate your chart in this cell
# We've shortened the name of the Ratio column label.
compensation = compensation.relabeled('Ratio of CEO pay to average industry worker pay', 'Ratio')
...

# Customization
plt.title('Total Pay vs. Ratio')
plt.show()

<!-- END QUESTION -->

Great work! You now have some practice with the basics of data visualization.

---

## Submit Your Assignment to Canvas

Follow these steps to submit your lab assignment:

1. **Check the Assignment Completion Requirements:** This assignment is scored as Complete or Incomplete. Make sure to check with your instructor about their requirements for a Complete score. 
2. **Run the Auto-Grader:** Ensure you have executed the code cell containing the command `grader.check_all()` to run all tests for auto-graded tasks marked with 📍. This command will execute all auto-grader tests sequentially.
3. **Complete Manually Graded Tasks:** Verify that you have responded to all the manually graded tasks marked with 📍🔎.
4. **Save Your Work:** In the notebook's Toolbar, go to `File -> Save Notebook` to save your work and create a checkpoint.
5. **Download the Notebook:** In the notebook's Toolbar, go to `File -> Download HTML` to download the HTML version (`.html`) of this notebook.
6. **Upload to Canvas:** On the Canvas Assignment page, click "Start Assignment" or "New Attempt" to upload the downloaded `.html` file.

---

## Attribution

This content is licensed under the <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)</a> and derived from the <a href="https://www.data8.org/">Data 8: The Foundations of Data Science</a> offered by the University of California, Berkeley.

<img src="./by-nc-sa.png" width=100px>

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()