# Week 05 - Assignment

## Arrays

Computers are most useful when you can use a small amount of code to *do the same action* to *many different things*.

For example, in the time it takes you to calculate the 18% tip on a restaurant bill, a laptop can calculate 18% tips for every restaurant bill paid by every human on Earth that day.  (That's if you're pretty fast at doing arithmetic in your head!)

**Arrays** are how we put many values in one place so that we can operate on them as a group. For example, if `billions_of_numbers` is an array of numbers, the expression

    .18 * billions_of_numbers

gives a new array of numbers that contains the result of multiplying **each number** in `billions_of_numbers` by .18.  Arrays are not limited to numbers; we can also put all the words in a book into an array of strings.

Concretely, an array is a **collection of values of the same type**. 

### Making Arrays

First, let's learn how to manually input values into an array. This typically isn't how programs work. Normally, we create arrays by loading them from an external source, like a data file.

To create an array by hand, call the function `np.array`.  Each argument you pass to `np.array` will be in the array it returns.  Run this cell to see an example:

In [None]:
# https://numpy.org/doc/stable/user/basics.creation.html
import numpy as np

my_array = np.array([0.125, 4.75, -1.3])
my_array

Each value in an array (in the above case, the numbers 0.125, 4.75, and -1.3) is called an *element* of that array.

Arrays themselves are also values, just like numbers and strings.  That means you can assign them to names or use them as arguments to functions. For example, `len(<some_array>)` returns the number of elements in `some_array`.

Make an array containing the numbers 0, 1, -1, and $\pi$, in that order.  Name it `interesting_numbers`.  

*Hint:* Remember how we got the value $\pi$?  Use `math` module.

In [None]:
# import numpy module (use appropriate alias for numpy) and math module
... # replace this ellipsis with the appropriate import statement as with similarly placed ellipses
...

interesting_numbers = ...
interesting_numbers

Make an array containing the five strings "Hello", ",", " ", "world", and "!". (The third one is a single space inside quotes.) Name it hello_world_components.

Note: If you evaluate hello_world_components, you'll notice some extra information in addition to its contents: dtype='<U5'. That's just NumPy's extremely cryptic way of saying that the data types in the array are strings.

In [None]:
hello_world_components = ...
hello_world_components

###  `np.arange`

Arrays are provided by a package called [NumPy](http://www.numpy.org/) (pronounced "NUM-pie"). The package is called `numpy`, but it's standard to rename it `np` for brevity.  You can do that with:

    import numpy as np

Very often in data science, we want to work with many numbers that are evenly spaced within some range.  NumPy provides a special function for this called `arange`.  The line of code `np.arange(start, stop, step)` evaluates to an array with all the numbers starting at `start` and counting up by `step`, stopping **before** `stop` is reached.

Run the following cells to see some examples!

In [None]:
# This array starts at 1 and counts up by 2
# and then stops before 6
np.arange(1, 6, 2)

In [None]:
# This array doesn't contain 9
# because np.arange stops *before* the stop value is reached
np.arange(4, 9, 1)

 Import numpy as np and then use np.arange to create an array with the multiples of 99 from 0 up to (and including) 9999. (So its elements are 0, 99, 198, 297, etc.)

In [None]:
...

multiples_of_99 = ...
multiples_of_99

### Working with Single Elements of Arrays ("indexing")

Let's work with a more interesting dataset.  The next cell creates an array called `population_amounts` that includes estimated world populations of every year from **1950** to roughly the present.  (The estimates come from the US Census Bureau website.)

In [None]:
import pandas as pd

population_amounts = pd.read_csv('https://raw.githubusercontent.com/data-8/materials-sp22/main/materials/sp22/lab/lab03/world_population.csv')
population_amounts.head()

Here's how we get the first element of `population_amounts`, which is the world population in the first year in the dataset, 1950.

In [None]:
# get the population_amounts at row index 0
population_amounts.iloc[0]

The value of that expression is the number 2,557,628,654 (around 2.5 billion), because that's the first thing in the array `population_amounts`.

Notice that we wrote `.iloc(0)`, not `.iloc(1)`, to get the first element.  This is a weird convention in computer science.  **0 is called the *index* of the first item.**  It's the number of elements that appear *before* that item.  So 3 is the index of the 4th item.

Read this: https://www.statology.org/pandas-loc-vs-iloc/#:~:text=When%20it%20comes%20to%20selecting,columns%20at%20specific%20integer%20positions

Here are some more examples.  In the examples, we've given names to the things we get out of `population_amounts`.  Read and run each cell.

In [None]:
# The 13th element in the array is the population
# in 1962 (which is 1950 + 12).
population_1962 = population_amounts.iloc[12]
population_1962

In [None]:
# The 66th element is the population in 2015.
population_2015 = ...
population_2015

Since `np.array` returns an array, we can call `[3]` on its output to get its 4th element, just like we "chained" together calls to the method `replace` earlier.

In [None]:
np.array([-1, -3, 4, -2])[3]

Set population_1973 to the world population in 1973, by getting the appropriate element from population_amounts using `iloc`.

In [None]:
population_1973 = ...
population_1973

### Doing something to every element of an array

Arrays are primarily useful for doing the same operation many times, so we don't often have to use `.iloc` and work with single elements.

#### Rounding
Here is one simple question we might ask about world population:

> How big was the population in each year, rounded to the nearest million?

Rounding is often used with large numbers when we don't need as much precision in our numbers. One example of this is when we present data in tables and visualizations. 

We could try to answer our question using the `round` function that is built into Python and the `item` method you just saw. 

**Note:** the `round` function takes in two arguments: the number to be rounded, and the number of decimal places to round to. The second argument can be thought of as how many steps right or left you move from the decimal point. Negative numbers tell us to move left, and positive numbers tell us to move right. So, if we have `round(1234.5, -2)`, it means that we should move two places left, and then make all numbers to the right of this place zeroes. This would output the number 1200.0. On the other hand, if we have `round(6.789, 1)`, we should move one place right, and then make all numbers to the right of this place zeroes. This would output the number 6.8.

In [None]:
population_1950_magnitude = round(population_amounts.iloc[0], -6)
population_1951_magnitude = round(population_amounts.iloc[1], -6)
population_1952_magnitude = round(population_amounts.iloc[2], -6)
population_1953_magnitude = round(population_amounts.iloc[3], -6)

But this is tedious and doesn't really take advantage of the fact that we are using a computer.

Instead, NumPy provides its own version of `round` that rounds each element of an array.  It takes in two arguments: a single array of numbers, and the number of decimal places to round to.  It returns an array of the same length, where the first element of the result is the first element of the argument rounded, and so on.

Use `np.round` to compute the world population in every year, rounded to the nearest million (6 zeroes).  Give the result (an array of 66 numbers) the name `population_rounded`.  Your code should be very short.

In [None]:
...

population_rounded = ...
population_rounded

What you just did is called **elementwise** application of `np.round`, since `np.round` operates separately on each element of the array that it's called on. Here's a picture of what's going on:

<img src="https://raw.githubusercontent.com/data-8/materials-sp22/main/materials/sp22/lab/lab03/array_round.jpg">

The textbook's [section](https://www.inferentialthinking.com/chapters/05/1/Arrays)  on arrays has a useful list of NumPy functions that are designed to work elementwise, like `np.round`.

### Arithmetic

Arithmetic also works elementwise on arrays, meaning that if you perform an arithmetic operation (like subtraction, division, etc) on an array, Python will do the operation to every element of the array individually and return an array of all of the results. For example, you can divide all the population numbers by 1 billion to get numbers in billions:

In [None]:
population_in_billions = population_amounts / 1000000000
population_in_billions

In [None]:
# more array functions
restaurant_bills = np.array([20.12, 39.90, 31.01])
print("Restaurant bills:\t", restaurant_bills)

# Array multiplication
tips = .2 * restaurant_bills
print("Tips:\t\t\t", tips)

<img src="https://raw.githubusercontent.com/data-8/materials-sp22/main/materials/sp22/lab/lab03/array_multiplication.jpg">

Suppose the total charge at a restaurant is the original bill plus the tip. If the tip is 20%, that means we can multiply the original bill by 1.2 to get the total charge.  Compute the total charge for each bill in `restaurant_bills`, and assign the resulting array to `total_charges`.

In [None]:
total_charges = ...
total_charges

## Functions

### Defining functions

Let's write a very simple function that converts a proportion to a percentage by multiplying it by 100.  For example, the value of `to_percentage(.5)` should be the number 50 (no percent sign).

A function definition has a few parts.

#### `def`

It always starts with `def` (short for **def**ine):

#### `name`

Next comes the name of the function.  Like other names we've defined, it can't start with a number or contain spaces. Let's call our function `to_percentage`:
    
    def to_percentage

#### `signature`

Next comes something called the *signature* of the function.  This tells Python how many arguments your function should have, and what names you'll use to refer to those arguments in the function's code.  A function can have any number of arguments (including 0!). 

`to_percentage` should take one argument, and we'll call that argument `proportion` since it should be a proportion.

    def to_percentage(proportion)
    
If we want our function to take more than one argument, we add a comma between each argument name. Note that if we had zero arguments, we'd still place the parentheses () after that name. 

We put a **colon** after the signature to tell Python that the next indented lines are the body of the function. If you're getting a syntax error after defining a function, check to make sure you remembered the colon!

    def to_percentage(proportion):

#### `documentation`

Functions can do complicated things, so you should write an explanation of what your function does.  For small functions, this is less important, but it's a good habit to learn from the start.  Conventionally, Python functions are documented by writing an **indented** triple-quoted string:

    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
        
#### `body`
Now we start writing code that runs when the function is called.  This is called the *body* of the function and every line **must be indented with a tab**.  Any lines that are *not* indented and left-aligned with the def statement is considered outside the function. 

Some notes about the body of the function:
- We can write code that we would write anywhere else.  
- We use the arguments defined in the function signature. We can do this because we assume that when we call the function, values are already assigned to those arguments.
- We generally avoid referencing variables defined *outside* the function. If you would like to reference variables outside of the function, pass them through as arguments!


Now, let's give a name to the number we multiply a proportion by to get a percentage:

    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
        factor = 100
        
#### `return`
The special instruction `return` is part of the function's body and tells Python to make the value of the function call equal to whatever comes right after `return`.  We want the value of `to_percentage(.5)` to be the proportion .5 times the factor 100, so we write:

    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
        factor = 100
        return proportion * factor
        
`return` only makes sense in the context of a function, and **can never be used outside of a function**. `return` is always the last line of the function because Python stops executing the body of a function once it hits a `return` statement. If a function does not have a return statement, it will not return anything; if you expect a value back from the function, make sure to include a return statement. 

*Note:*  `return` inside a function tells Python what value the function evaluates to. However, there are other functions, like `print`, that have no `return` value. For example, `print` simply prints a certain value out to the console. 

In short, `return` is used when you want to tell the *computer* what the value of some variable is, while `print` is used to tell you, a *human*, its value.        

Define `to_percentage` in the cell below.  Call your function to convert the proportion .2 to a percentage.  Name that percentage `twenty_percent`.

In [None]:
def to_percentage():
    ''' ... '''
    factor = ...
    return ...

twenty_percent = ...
twenty_percent

Here's something important about functions: the names assigned *within* a function body are only accessible within the function body. Once the function has returned, those names are gone.  So even if you created a variable called `factor` and defined `factor = 100` inside of the body of the `to_percentage` function and then called `to_percentage`, `factor` would not have a value assigned to it outside of the body of `to_percentage`.

### `print` is not the same as `return`
The `print_kth_top_movie_year(k)` function prints the total gross movie sales for the year that was provided! However, since we did not return any value in this function, we can not use it after we call it. Let's look at an example of another function that prints a value but does not return it.

In [None]:
def print_number_five():
    print(5)

In [None]:
print_number_five()

However, if we try to use the output of `print_number_five()`, we see that the value `5` is printed but we get a TypeError when we try to add the number 2 to it!

In [None]:
print_number_five_output = print_number_five()
print_number_five_output + 2

It may seem that `print_number_five()` is returning a value, 5. In reality, it just displays the number 5 to you without giving you the actual value! If your function prints out a value **without returning it** and you try to use that value, you will run into errors, so be careful!

Explain to your neighbor or a staff member how you might add a line of code to the `print_number_five` function (after `print(5)`) so that the code `print_number_five_output + 5` would result in the value `10`, rather than an error.

### Functions and CEO Incomes

In this question, we'll look at the 2015 compensation of CEOs at the 100 largest companies in California. The data was compiled from a [Los Angeles Times analysis](http://spreadsheets.latimes.com/california-ceo-compensation/), and ultimately came from [filings](https://www.sec.gov/answers/proxyhtf.htm) mandated by the SEC from all publicly-traded companies. Two companies have two CEOs, so there are 102 CEOs in the dataset.

We've copied the raw data from the LA Times page into a file called `raw_compensation.csv`. (The page notes that all dollar amounts are in **millions of dollars**.)

In [None]:
import pandas as pd

raw_compensation = pd.read_csv('https://raw.githubusercontent.com/data-8/materials-sp22/main/materials/sp22/lab/lab04/raw_compensation.csv')
raw_compensation.head()

We want to compute the average of the CEOs' pay. Try running the cell below.

In [None]:
import numpy as np

raw_compensation['Total Pay'].mean()

The problem is that `Total Pay` has a dollar sign and therefore is considered a string. Remove the dollar sign in Total Pay and convert the column to a float. Then output the mean of Total Pay

In [None]:
# try a one liner without using a function
raw_compensation['Total Pay'] = ...
# raw_compensation['Total Pay'].mean()

### `apply`ing functions

In [None]:
import pandas as pd

raw_compensation = pd.read_csv('https://raw.githubusercontent.com/data-8/materials-sp22/main/materials/sp22/lab/lab04/raw_compensation.csv')
raw_compensation.head()

# define a function to return $ to float
def dollar_to_float(row):
    return float(row['Total Pay'].replace('$', ''))

raw_compensation['Total Pay'] = raw_compensation.apply(dollar_to_float, axis=1)
raw_compensation['Total Pay'].mean()

### Histograms

Earlier, we computed the average pay among the CEOs in our 102-CEO dataset.  The average doesn't tell us everything about the amounts CEOs are paid, though.  Maybe just a few CEOs make the bulk of the money, even among these 102.

We can use a *histogram* method to display the *distribution* of a set of numbers.  The pandas method `hist` takes a single argument, the name of a column of numbers.  It produces a histogram of the numbers in that column.

Make a histogram of the total pay of the CEOs in `Total Pay`. 

In [None]:
# using pandas (not matplotlib) create a histogram of Total Pay

## Create a Chart

From the examples below, create a chart and explain the chart in 180 - 250 words. Many of the charts come with their own datasets and are self explaind so choose a chart that is skills appropriate.

* https://matplotlib.org/stable/gallery/index.html 
* http://seaborn.pydata.org/examples/ 
* https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html 

In [None]:
# code for creating chart
...

Explanation of the chart in 180 - 250 words