In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab_2.ipynb")

# Lab 2: Data Types, Arrays, and Table operations

Now that you've made it through the first bit of classes, welcome to Lab 2!  

#### Today's lab:

In today's lab, you'll go over:
* a review on some of the Python building blocks we went over last week
* how to import a module in Python
* practice with table operations
* basic analysis upon our table data

Recommended Reading:
 * [Introduction to Tables](https://www.inferentialthinking.com/chapters/03/4/Introduction_to_Tables)
 
Recommended Videos:
 * Intro to Python
 * Intro to Tables

First, set up the tests and imports by running the cell below.

In [None]:
# Just run this cell

import numpy as np
import math
from datascience import *

# These lines load the tests.
# When you log-in please hit return (not shift + return) after typing in your email


## 1. Review: The building blocks of Python code

If any of this terminology is still feeling a little fresh, feel free to refer back to the demo videos and relevant textbook chapters to help you work through the assignment! And of course, reach out to a TA if you're feeling stuck. 

Rememeber that the two building blocks of Python code are *expressions* and *statements*.  An **expression** is a piece of code that

* is self-contained, meaning it would make sense to write it on a line by itself, and
* usually evaluates to a value.


Here are two expressions that both evaluate to 3:

    3
    5 - 2
    
One important type of expression is the **call expression**, (sometimes also referred to as a function call). A call expression begins with the name of a function and is followed by the argument(s) of that function in parentheses. The function returns some value, based on its arguments. Some important mathematical functions are listed below.

| Function | Description                                                   |
|----------|---------------------------------------------------------------|
| `abs`      | Returns the absolute value of its argument                    |
| `max`      | Returns the maximum of all its arguments                      |
| `min`      | Returns the minimum of all its arguments                      |
| `pow`      | Raises its first argument to the power of its second argument |
| `round`    | Rounds its argument to the nearest integer                     |

Here are two call expressions that both evaluate to 3:

    abs(2 - 5)
    max(round(2.8), min(pow(2, 10), -1 * pow(2, 10)))

The expression `2 - 5` and the two call expressions given above are examples of **compound expressions**, meaning that they are actually combinations of several smaller expressions.  `2 - 5` combines the expressions `2` and `5` by subtraction.  In this case, `2` and `5` are called **subexpressions** because they're expressions that are part of a larger expression. 

A **statement** is a whole line of code.  Some statements are just expressions.  The expressions listed above are examples.

Other statements *make something happen* rather than *having a value*. For example, an **assignment statement** assigns a value to a name. 

A good way to think about this is that we're **evaluating the right-hand side** of the equals sign and **assigning it to the left-hand side**. Here are some assignment statements:
    
    height = 1.3
    the_number_five = abs(-5)
    absolute_height_difference = abs(height - 1.688)

An important idea in programming is that large, interesting things can be built by combining many simple, uninteresting things.  The key to understanding a complicated piece of code is breaking it down into its simple components.

For example, a lot is going on in the last statement above, but it's really just a combination of a few things.  This picture describes what's going on.

<img src="statement.png">

<span style="color:blue">**Question 1.0.1**</span> In the next cell, assign the name `new_year` to the larger among the following two numbers:

1. the **absolute value** of $2^{5}-2^{11}-2^1-3 $, and 
2. $5 \times 13 \times 31 + 7$.

Try to use just one statement (one line of code). Be sure to check your work by executing the test cell afterward.

<!--
BEGIN QUESTION
name: q11
-->

In [None]:
new_year = ...
new_year

In [None]:
grader.check("q11")

We've asked you to use one line of code in the question above because it only involves mathematical operations. However, more complicated programming questions will more require more steps. It isn’t always a good idea to jam these steps into a single line because it can make the code harder to read and harder to debug.

Good programming practice involves splitting up your code into smaller steps and using appropriate names. You'll have plenty of practice in the rest of this course!

## 2. Importing Code

![imports](https://external-preview.redd.it/ZVPjiFo_Ubl4JeiU63SaTjdIoq5zveSnNZimKpgn2I8.png?auto=webp&s=bf32c94b630befa121075c1ae99b2599af6dedc5) 

[Source](https://www.reddit.com/r/ProgrammerHumor/comments/cgtk7s/theres_no_need_to_reinvent_the_wheel_oc/)

Most programming involves work that is very similar to work that has been done before.  Since writing code is time-consuming, it's good to rely on others' published code when you can.  Rather than copy-pasting, Python allows us to **import modules**. A module is a file with Python code that has defined variables and functions. By importing a module, we are able to use its code in our own notebook.

Python includes many useful modules that are just an `import` away.  We'll look at the `math` module as a first example. The `math` module is extremely useful in computing mathematical expressions in Python. 

Suppose we want to very accurately compute the area of a circle with a radius of 5 meters.  For that, we need the constant $\pi$, which is roughly 3.14.  Conveniently, the `math` module has `pi` defined for us, and check out how we use it:

In [None]:
import math
radius = 5
area_of_circle = radius**2 * math.pi
area_of_circle

In the code above, the line `import math` imports the math module. This statement brings in that module and then assigns the name `math` to that module, to use in our code. We are now able to access any variables or functions defined within `math` by typing the name of the module followed by a dot, then followed by the name of the variable or function we want. If you recall from the last assignment, this is very similar to how we used methods belonging to the `Table` type object!

    <module name>.<name>

<span style="color:blue">**Question 2.0.1**</span> The module `math` also provides the name `e` for the base of the natural logarithm, which is roughly 2.71. Compute $e^{\pi}-\pi$, giving it the name `near_twenty`.

*Remember: You can access `pi` from the `math` module as well!*

<!--
BEGIN QUESTION
name: q21
-->

In [None]:
near_twenty = ...
near_twenty

In [None]:
grader.check("q21")

![XKCD](http://imgs.xkcd.com/comics/e_to_the_pi_minus_pi.png)

[Source](http://imgs.xkcd.com/comics/e_to_the_pi_minus_pi.png)
[Explanation](https://www.explainxkcd.com/wiki/index.php/217:_e_to_the_pi_Minus_pi)

### 2.1. Accessing functions

In the question above, you accessed variables within the `math` module. 

**Modules** also define **functions**.  For example, `math` provides the name `sin` for the sine function.  Having imported `math` already, we can write `math.sin(3)` to compute the sine of 3.  (Note that this sine function considers its argument to be in [radians](https://en.wikipedia.org/wiki/Radian), not degrees.  180 degrees are equivalent to $\pi$ radians.)

<span style="color:blue">**Question 2.1.1.**</span> A $\frac{\pi}{4}$-radian (45-degree) angle forms a right triangle with equal base and height, pictured below.  If the hypotenuse (the radius of the circle in the picture) is 1, then the height is $\sin(\frac{\pi}{4})$.  Compute that value using `sin` and `pi` from the `math` module.  Give the result the name `sine_of_pi_over_four`, in the next cell.

<img src="http://mathworld.wolfram.com/images/eps-gif/TrigonometryAnglesPi4_1000.gif">

[Source](http://mathworld.wolfram.com/images/eps-gif/TrigonometryAnglesPi4_1000.gif)

<!--
BEGIN QUESTION
name: q211
-->

In [None]:
sine_of_pi_over_four = ...
sine_of_pi_over_four

In [None]:
grader.check("q211")

For your reference, below are some more examples of functions from the `math` module.

Notice how different functions take in different numbers of arguments. Often, the [documentation](https://docs.python.org/3/library/math.html) of the module will provide information on how many arguments are required for each function.

*Hint: If you press `shift+tab` while next to the function call, the documentation for that function will appear!*

In [None]:
# Calculating logarithms (the logarithm of 8 in base 2).
# The result is 3 because 2 to the power of 3 is 8.
math.log(8, 2)

In [None]:
# Calculating square roots.
math.sqrt(5)

There are various ways to import and access code from outside sources. The method we used above — `import <module_name>` — imports the entire module and requires that we use `<module_name>.<name>` to access its code. 

We can also import a specific constant or function instead of the entire module. Notice that you don't have to use the module name beforehand to reference that particular value. However, you do have to be careful about reassigning the names of the constants or functions to other values!

In [None]:
# Importing just cos and pi from math.
# We don't have to use `math.` in front of cos or pi
from math import cos, pi
print(cos(pi))

# We do have to use it in front of other functions from math, though
math.log(pi)

Or we can import every function and value from the entire module, using the `*`, like we were doing before:

In [None]:
# Lastly, we can import everything from math using the *
# Once again, we don't have to use 'math.' beforehand 
from math import *
log(pi)

Don't worry too much about which type of import to use. It's often a coding style choice left up to each programmer. In this course, you'll always import the necessary modules when you run the setup cell (like the first code cell in this lab).

Let's move on to practicing some of the table operations you've learned in lecture!

## 3. Table operations

The table `trees.csv` contains data on trees. All trees are eatable, but this data shows which ones are edible.
Run the next cell to load the `trees` table.

In [None]:
# Just run this cell

trees = Table.read_table('trees.csv')

Let's examine our table to see what data it contains.

<span style="color:blue">**Question 3.0.1.**</span> Use the method `show` to display the first 5 rows of `trees`. 

*Reminder:* The terms "method" and "function" are technically not the same thing, but for the purposes of this course, we will often use them interchangeably.

**Hint:** `tbl.show(3)` will show the first 3 rows of the table `tbl`. Additionally, make sure not to call `.show()` without an argument, as this will crash your kernel!


In [None]:
...

Notice that some of the values in this table are missing, as denoted by "nan." This means either that the value is not available or not applicable. You'll also notice that the table has a large number of columns in it!

### `num_columns`

The table property `num_columns` returns the number of columns in a table. (A "property" is just a method that doesn't need to be called by adding parentheses.)

Example call: `<tbl>.num_columns`

<span style="color:blue">**Question 3.0.2**</span> Use `num_columns` to find the number of columns in our trees dataset.

Assign the number of columns to `num_trees_columns`.

<!--
BEGIN QUESTION
name: q32
-->

In [None]:
num_trees_columns = ...
print("The table has", num_trees_columns, "columns in it!")

In [None]:
grader.check("q32")

### `num_rows`

Similarly, the property `num_rows` tells you how many rows are in a table.

In [None]:
# Just run this cell

num_trees_rows = trees.num_rows
print("The table has", num_trees_rows, "rows in it!")

### `select`

Most of the columns are about location -- which neighbourhood, longitude and latitude etc.  If we're not interested in that information, it just makes the table difficult to read.  This comes up more than you might think, because people who collect and publish data may not know ahead of time what people will want to do with it.

In such situations, we can use the table method `select` to choose only the columns that we want in a particular table. It takes any number of arguments. Each should be the name of a column in the table. It returns a new table with only those columns in it. The columns are in the order *in which they were listed as arguments*.

For example, the value of `trees.select("SPECIES_BOTANICAL","DIAMETER_BREAST_HEIGHT")` is a table with only the type of tree and measurement.



<span style="color:blue">**Question 3.0.3**</span> Use `select` to create a table with only the common name, latitude, and longitude of each tree.  Call that new table `trees_locations`. (Use separate longitude and latitude columns)

*Hint:* Make sure to be exact when using column names with `select`; double-check capitalization!

<!--
BEGIN QUESTION
name: q33
-->

In [None]:
trees_locations = ...
trees_locations

In [None]:
grader.check("q33")

### `drop`

`drop` serves the same purpose as `select`, but it takes away the columns that you provide rather than the ones that you don't provide. Like `select`, `drop` returns a new table.

<span style="color:blue">**Question 3.0.4**</span> Suppose you just didn't want the `ID` and `CULTIVAR` columns in `trees_markets`.  Create a table that's a copy of `trees_markets` but doesn't include those columns.  Call that table `trees_without_id`.

<!--
BEGIN QUESTION
name: q34
-->

In [None]:
trees_without_id = ...
trees_without_id

In [None]:
grader.check("q34")

Now, suppose we want to answer some questions about when trees were planted. Let's get all the most recent plantings.

To answer this, we'll sort `trees` by when they were planted!

In [None]:
trees.sort('PLANTED_DATE')

Oops, that didn't answer our question because we sorted from oldest to newest. We'll have to sort in reverse order.

If you are having trouble finding the column of interest in your result, you can do a `select` of the planted date.

In [None]:
trees.sort('PLANTED_DATE', descending=True)

(The `descending=True` bit is called an *optional argument*. It has a default value of `False`, so when you explicitly tell the function `descending=True`, then the function will sort in descending order.)

### `sort`

Some details about sort:

1. The first argument to `sort` is the name of a column to sort by.
2. If the column has text in it, `sort` will sort alphabetically; if the column has numbers, it will sort numerically.
3. The value of `trees.sort("PLANTED_DATE")` is a *copy* of `trees`; the `trees` table doesn't get modified. For example, if we called `trees.sort("PLANTED_DATE")`, then running `trees` by itself would still return the unsorted table.
4. Rows always stick together when a table is sorted.  It wouldn't make sense to sort just one column and leave the other columns alone.  For example, in this case, if we sorted just the `PLANTED_COLUMN` column, the trees would all end up with the wrong dates.


Now let's say we want a table of all trees in the Donsdale neighbourhood. Sorting won't help us much here.

Instead, we use the table method `where`.

In [None]:
donsdale_trees = trees.where('NEIGHBOURHOOD_NAME', are.equal_to('DONSDALE'))
donsdale_trees

Ignore the syntax for the moment.  Instead, try to read that line like this:

> Assign the name **`donsdale_trees`** to a table whose rows are the rows in the **`trees`** table **`where`** the **`'NEIGHBOURHOOD_NAME'`**s **`are` `equal` `to` `DONSDALE`**.

### `where`

Now let's dive into the details a bit more.  `where` takes 2 arguments:

1. The name of a column.  `where` finds rows where that column's values meet some criterion.
2. A predicate that describes the criterion that the column needs to meet.

The predicate in the example above called the function `are.equal_to` with the value we wanted, 'DONSDALE'.  We'll see other predicates soon.

`where` returns a table that's a copy of the original table, but **with only the rows that meet the given predicate**.

<span style="color:blue">**Question 3.0.6** Use `donsdale_trees` to create a table called `donsdale_boulevard` containing trees on boulevards in donsdale.
<!--
BEGIN QUESTION
name: q36
-->

In [None]:
donsdale_boulevard = ...
donsdale_boulevard

In [None]:
grader.check("q36")

If you noticed the `donsdale_trees` has almost the same amount of rows as `donsdale_boulevard`. What were the other location types? You could use one of these other predicates to find out easily.

So far we've only been using `where` with the predicate that requires finding the values in a column to be *exactly* equal to a certain value. However, there are many other predicates. Here are a few:

|Predicate|Example|Result|
|-|-|-|
|`are.equal_to`|`are.equal_to(50)`|Find rows with values equal to 50|
|`are.not_equal_to`|`are.not_equal_to(50)`|Find rows with values not equal to 50|
|`are.above`|`are.above(50)`|Find rows with values above (and not equal to) 50|
|`are.above_or_equal_to`|`are.above_or_equal_to(50)`|Find rows with values above 50 or equal to 50|
|`are.below`|`are.below(50)`|Find rows with values below 50|
|`are.between`|`are.between(2, 10)`|Find rows with values above or equal to 2 and below 10|

## 4. Analyzing a dataset

Now that you're familiar with table operations, let’s answer an interesting question about a dataset!

Run the cell below to load the `imdb` table and print it out. It contains information about the 250 highest-rated movies on IMDb, the Internet Movie Database.

In [None]:
# Just run this cell

imdb = Table.read_table('imdb.csv')
imdb

Often, we want to perform multiple operations - sorting, filtering, or others - in order to turn a table we have into something more useful. You can do these operations one by one, e.g.

```
first_step = original_tbl.where(“col1”, are.equal_to(12))
second_step = first_step.sort(‘col2’, descending=True)
```

However, since the value of the expression `original_tbl.where(“col1”, are.equal_to(12))` is itself a table, you can just call a table method on it:

```
original_tbl.where(“col1”, are.equal_to(12)).sort(‘col2’, descending=True)
```
You should organize your work in the way that makes the most sense to you, using informative names for any intermediate tables you create. 

<span style="color:blue">**Question 4.0.1**</span> Create a table of movies released between 2010 and 2016 (inclusive) with ratings above 8. The table should only contain the columns `Title` and `Rating`, **in that order**.

Assign the table to the name `above_eight`.

*Hint:* Think about the steps you need to take, and try to put them in an order that make sense. Feel free to create intermediate tables for each step, but please make sure you assign your final table the name `above_eight`!

<!--
BEGIN QUESTION
name: q41
-->

In [None]:
above_eight = ...
above_eight

In [None]:
grader.check("q41")

<span style="color:blue">**Question 4.0.2**</span> Use `num_rows` (and arithmetic) to find the *proportion* of movies in the dataset that were released 1900-1999, and the *proportion* of movies in the dataset that were released in the year 2000 or later.

Assign `proportion_in_20th_century` to the proportion of movies in the dataset that were released 1900-1999, and `proportion_in_21st_century` to the proportion of movies in the dataset that were released in the year 2000 or later.

*Hint:* The *proportion* of movies released in the 1900's is the *number* of movies released in the 1900's, divided by the *total number* of movies.

<!--
BEGIN QUESTION
name: q42
-->

In [None]:
num_movies_in_dataset = ...
num_in_20th_century = ...
num_in_21st_century = ...
proportion_in_20th_century = ...
proportion_in_21st_century = ...
print("Proportion in 20th century:", proportion_in_20th_century)
print("Proportion in 21st century:", proportion_in_21st_century)

In [None]:
grader.check("q42")

## 5. Arrays

Computers are most useful when you can use a small amount of code to *do the same action* to *many different things*.

For example, in the time it takes you to calculate the 18% tip on a restaurant bill, a laptop can calculate 18% tips for every restaurant bill paid by every human on Earth that day.  (That's if you're pretty fast at doing arithmetic in your head!)

**Arrays** are how we put many values in one place so that we can operate on them as a group. For example, if `billions_of_numbers` is an array of numbers, the expression

    .18 * billions_of_numbers

gives a new array of numbers that contains the result of multiplying each number in `billions_of_numbers` by .18.  Arrays are not limited to numbers; we can also put all the words in a book into an array of strings.

Concretely, an array is a **collection of values of the same type**. 

### 5.1. Making arrays

First, let's learn how to manually input values into an array. This typically isn't how programs work. Normally, we create arrays by loading them from an external source, like a data file.

But nonetheless, to create an array by hand, call the function `make_array`.  Each argument you pass to `make_array` will be in the array it returns.  Run this cell to see an example:

In [None]:
make_array(0.125, 4.75, -1.3)

Each value in an array (in the above case, the numbers 0.125, 4.75, and -1.3) is called an *element* of that array.

Arrays themselves are also values in Python, just like numbers and strings.  That means you can assign them to names or use them as arguments to functions. For example, `len(some_array)` returns the number of elements in `some_array`.

<span style="color:blue">**Question 5.1.1.**</span> Make an array containing the numbers 0, 1, -1, $\pi$, and $e$, in that order.  Name it `interesting_numbers`.  

*Hint:* How did you get the values $\pi$ and $e$ earlier up in this lab?  You can refer to them in exactly the same way here.

<!--
BEGIN QUESTION
name: q511
-->

In [None]:
interesting_numbers = ...
interesting_numbers

In [None]:
grader.check("q511")

<span style="color:blue">**Question 5.1.2.**</span> Make an array containing the five strings `"Hello"`, `","`, `" "`, `"world"`, and `"!"`.  (The third one is a single space inside quotes.)  Name it `hello_world_components`.

*Note:* If you evaluate `hello_world_components`, you'll notice some extra information in addition to its contents: `dtype='<U5'`.  That's just NumPy's extremely cryptic way of saying that the data types in the array are strings.

<!--
BEGIN QUESTION
name: q512
-->

In [None]:
hello_world_components = ...
hello_world_components

In [None]:
grader.check("q512")

###  `np.arange`
Arrays are provided by a package called [NumPy](http://www.numpy.org/) (pronounced "NUM-pie"). The package is called `numpy`, but it's standard to rename it `np` for brevity.  You can do that with:

    import numpy as np

You'll also notice we have this conveniently done for you already in the setup cell at the top of your assignment notebooks! 

Very often in data science, we want to work with many numbers that are evenly spaced within some range.  NumPy provides a special function for this called `arange`.  The line of code `np.arange(start, stop, step)` evaluates to an array with all the numbers starting at `start` and counting up by `step`, stopping **before** `stop` is reached.

Run the following cells to see some examples!

In [None]:
# This array starts at 1 and counts up by 2
# and then stops before 6
np.arange(1, 6, 2)

In [None]:
# This array doesn't contain 9
# because np.arange stops *before* the stop value is reached
np.arange(4, 9, 1)

<span style="color:blue">**Question 5.1.3.**</span> Import `numpy` as `np` and then use `np.arange` to create an array with the multiples of 99 from 0 up to (**and including**) 9999.  (So its elements are 0, 99, 198, 297, etc.)

<!--
BEGIN QUESTION
name: q513
-->

In [None]:
import numpy as np 
multiples_of_99 = ...
multiples_of_99

In [None]:
grader.check("q513")

### Temperature readings
There are multiple weather stations in Canada that measure surface temperatures at different sites around the country.  The hourly readings are [publicly available](https://climate.weather.gc.ca/historical_data/search_historic_data_e.html).

Suppose we download all the hourly data from the Edmonton Internation Airport station for the month of May 2021.  To analyze the data, we want to know when each reading was taken, but we find that the data don't include the timestamps of the readings (the time at which each one was taken).

However, we know the first reading was taken at the first instant of May 2021 (midnight on December 1st) and each subsequent reading was taken exactly 1 hour after the last.

**Question 5.1.4.** Create an array of the *time, in seconds, since the start of the month* at which each hourly reading was taken.  Name it `collection_times`.

*Hint 1:* There were 31 days in May, which is equivalent to ($31 \times 24$) hours or ($31 \times 24 \times 60 \times 60$) seconds.  So your array should have $31 \times 24$ elements in it.

*Hint 2:* The `len` function works on arrays, too!  If your `collection_times` isn't passing the tests, check its length and make sure it has $31 \times 24$ elements.

<!--
BEGIN QUESTION
name: q514
-->

In [None]:
collection_times = ...
collection_times

In [None]:
grader.check("q514")

### 5.2 Working with single elements of arrays ("indexing")
Let's work with a more interesting dataset.  The next cell creates an array called `population_amounts` that includes estimated world populations in every year from **1950** to roughly the present.  (The estimates come from the US Census Bureau website.)

Rather than type in the data manually, we've loaded them from a file on your computer called `world_population.csv`.  You'll learn how to do that later in this lab!

In [None]:
population_amounts = Table.read_table("world_population.csv").column("Population")
population_amounts

Here's how we get the first element of `population_amounts`, which is the world population in the first year in the dataset, 1950.

In [None]:
population_amounts.item(0)

The value of that expression is the number 2557628654 (around 2.5 billion), because that's the first thing in the array `population_amounts`.

Notice that we wrote `.item(0)`, not `.item(1)`, to get the first element.  This is a weird convention in computer science.  0 is called the *index* of the first item.  It's the number of elements that appear *before* that item.  So 3 is the index of the 4th item.

Here are some more examples.  In the examples, we've given names to the things we get out of `population_amounts`.  Read and run each cell.

In [None]:
# The 13th element in the array is the population
# in 1962 (which is 1950 + 12).
population_1962 = population_amounts.item(12)
population_1962

In [None]:
# The 66th element is the population in 2015.
population_2015 = population_amounts.item(65)
population_2015

In [None]:
# The array has only 66 elements, so this doesn't work.
# (There's no element with 66 other elements before it.)
population_2016 = population_amounts.item(66)
population_2016

Since `make_array` returns an array, we can call `.item(3)` on its output to get its 4th element, just like we "chained" together calls to the method `replace` earlier.

In [None]:
make_array(-1, -3, 4, -2).item(3)

<span style="color:blue">**Question 5.2.1.**</span> Set `population_1973` to the world population in 1973, by getting the appropriate element from `population_amounts` using `item`.

<!--
BEGIN QUESTION
name: q521
-->

In [None]:
population_1973 = ...
population_1973

In [None]:
grader.check("q521")

## 5.3. Doing something to every element of an array
Arrays are primarily useful for doing the same operation many times, so we don't often have to use `.item` and work with single elements.

#### Logarithms
Here is one simple question we might ask about world population:

> How big was the population in *orders of magnitude* in each year?

Orders of magnitude quantify how big a number is by representing it as the power of another number (for example, representing 104 as $10^{2.017033}$). One way to do this is by using the logarithm function. The logarithm (base 10) of a number increases by 1 every time we multiply the number by 10. It's like a measure of how many decimal digits the number has, or how big it is in orders of magnitude.

We could try to answer our question like this, using the `log10` function from the `math` module and the `item` method you just saw:

In [None]:
population_1950_magnitude = math.log10(population_amounts.item(0))
population_1951_magnitude = math.log10(population_amounts.item(1))
population_1952_magnitude = math.log10(population_amounts.item(2))
population_1953_magnitude = math.log10(population_amounts.item(3))
...

But this is tedious and doesn't really take advantage of the fact that we are using a computer.

Instead, NumPy provides its own version of `log10` that takes the logarithm of each element of an array.  It takes a single array of numbers as its argument.  It returns an array of the same length, where the first element of the result is the logarithm of the first element of the argument, and so on.

<span style="color:blue">**Question 5.3.1.**</span> Use `np.log10` to compute the logarithms of the world population in every year.  Give the result (an array of 66 numbers) the name `population_magnitudes`.  Your code should be very short.

<!--
BEGIN QUESTION
name: q531
-->

In [None]:
population_magnitudes = ...
population_magnitudes

In [None]:
grader.check("q531")

What you just did is called *elementwise* application of `np.log10`, since `np.log10` operates separately on each element of the array that it's called on. Here's a picture of what's going on:

<img src="array_logarithm.jpg">


The textbook's [section](https://www.inferentialthinking.com/chapters/05/1/Arrays)  on arrays has a useful list of NumPy functions that are designed to work elementwise, like `np.log10`.

#### Arithmetic
Arithmetic also works elementwise on arrays, meaning that if you perform an arithmetic operation (like subtraction, division, etc) on an array, Python will do the operation to every element of the array individually and return an array of all of the results. For example, you can divide all the population numbers by 1 billion to get numbers in billions:

In [None]:
population_in_billions = population_amounts / 1000000000
population_in_billions

You can do the same with addition, subtraction, multiplication, and exponentiation (`**`). For example, you can calculate a tip on several restaurant bills at once (in this case just 3):

In [None]:
restaurant_bills = make_array(20.12, 39.90, 31.01)
print("Restaurant bills:\t", restaurant_bills)

# Array multiplication
tips = .2 * restaurant_bills
print("Tips:\t\t\t", tips)

<img src="array_multiplication.jpg">

<span style="color:blue">**Question 5.3.2.**</span> Suppose the total charge at a restaurant is the original bill plus the tip. If the tip is 20%, that means we can multiply the original bill by 1.2 to get the total charge.  Compute the total charge for each bill in `restaurant_bills`, and assign the resulting array to `total_charges`.

<!--
BEGIN QUESTION
name: q532
-->

In [None]:
total_charges = ...
total_charges

In [None]:
grader.check("q532")

<span style="color:blue">**Question 5.3.3.**</span> The array `more_restaurant_bills` contains 100,000 bills!  Compute the total charge for each one.  How is your code different?

<!--
BEGIN QUESTION
name: q533
-->

In [None]:
more_restaurant_bills = Table.read_table("more_restaurant_bills.csv").column("Bill")
more_total_charges = ...
more_total_charges

In [None]:
grader.check("q533")

The function `sum` takes a single array of numbers as its argument.  It returns the sum of all the numbers in that array (so it returns a single number, not an array).

<span style="color:blue">**Question 5.3.4.**</span> What was the sum of all the bills in `more_restaurant_bills`, *including tips*?

<!--
BEGIN QUESTION
name: q534
-->

In [None]:
sum_of_bills = ...
sum_of_bills

In [None]:
grader.check("q534")

<span style="color:blue">**Question 5.3.5.**</span> The powers of 2 ($2^0 = 1$, $2^1 = 2$, $2^2 = 4$, etc) arise frequently in computer science.  (For example, you may have noticed that storage on smartphones or USB flash drives come in powers of 2, like 16 GB, 32 GB, or 64 GB.)  Use `np.arange` and the exponentiation operator `**` to compute the first 30 powers of 2, starting from `2^0`.

*Hint 1:* `np.arange(1, 2**30, 1)` creates an array with $2^{30}$ elements and **will crash your kernel**. That's a lot of elements!

*Hint 2:* Part of your solution will involve `np.arange`, but your array shouldn't have more than 30 elements.

<!--
BEGIN QUESTION
name: q535
-->

In [None]:
powers_of_2 = ...
powers_of_2

In [None]:
grader.check("q535")

## 6. Creating Tables

An array is useful for describing a single attribute of each element in a collection. For example, let's say our collection is all US States. Then an array could describe the land area of each state. 

Tables extend this idea by containing multiple arrays, each one describing a different attribute for every element of a collection. In this way, tables allow us to not only store data about many entities but to also contain several kinds of data about each entity.

For example, in the cell below we have two arrays. The first one, `population_amounts`, was defined above in section 5.2 and contains the world population in each year (estimated by the US Census Bureau). The second array, `years`, contains the years themselves. These elements are in order, so the year and the world population for that year have the same index in their corresponding arrays.

In [None]:
# Just run this cell

years = np.arange(1950, 2015+1)
print("Population column:", population_amounts)
print("Years column:", years)

Suppose we want to answer this question:

> In which year did the world's population cross 6 billion?

You could technically answer this question just from staring at the arrays, but it's a bit convoluted, since you would have to count the position where the population first crossed 6 billion, then find the corresponding element in the years array. In cases like these, it might be easier to put the data into a *`Table`*, a 2-dimensional type of dataset. 

The expression below:

- creates an empty table using the expression `Table()`,
- adds two columns by calling `with_columns` with four arguments,
- assigns the result to the name `population`, and finally
- evaluates `population` so that we can see the table.

The strings `"Year"` and `"Population"` are column labels that we have chosen. The names `population_amounts` and `years` were assigned above to two arrays of the **same length**. The function `with_columns` (you can find the documentation [here](http://data8.org/datascience/tables.html)) takes in alternating strings (to represent column labels) and arrays (representing the data in those columns). The strings and arrays are separated by commas.

In [None]:
population = Table().with_columns(
    "Population", population_amounts,
    "Year", years
)
population

Now the data is combined into a single table! It's much easier to parse this data. If you need to know what the population was in 1959, for example, you can tell from a single glance.

<span style="color:blue">**Question 6.0.1**</span> In the cell below, we've created 2 arrays. Using the steps above, assign `top_10_movies` to a table that has two columns called "Rating" and "Name", which hold `top_10_movie_ratings` and `top_10_movie_names` respectively.

<!--
BEGIN QUESTION
name: q61
-->

In [None]:
top_10_movie_ratings = make_array(9.2, 9.2, 9., 8.9, 8.9, 8.9, 8.9, 8.9, 8.9, 8.8)
top_10_movie_names = make_array(
        'The Shawshank Redemption (1994)',
        'The Godfather (1972)',
        'The Godfather: Part II (1974)',
        'Pulp Fiction (1994)',
        "Schindler's List (1993)",
        'The Lord of the Rings: The Return of the King (2003)',
        '12 Angry Men (1957)',
        'The Dark Knight (2008)',
        'Il buono, il brutto, il cattivo (1966)',
        'The Lord of the Rings: The Fellowship of the Ring (2001)')

top_10_movies = ...

# We've put this next line here 
# so your table will get printed out 
# when you run this cell.
top_10_movies

In [None]:
grader.check("q61")

#### Loading a table from a file

In most cases, we aren't going to go through the trouble of typing in all the data manually. Instead, we load them in from an external source, like a data file. There are many formats for data files, but CSV ("comma-separated values") is the most common. 

`Table.read_table(...)` takes one argument (a path to a data file in string format) and returns a table.  

*Hint: You might be super familiar with this method from the demo video Intro to Tables!*

<span style="color:blue">**Question 6.0.2** Remember from earlier in the lab -- `imdb.csv` contains a table of information about the 250 highest-rated movies on IMDb.  Load it as a table called `imdb`. We've actually done this eariler in an above cell!

<!--
BEGIN QUESTION
name: q62
-->

In [None]:
imdb = ...
imdb

In [None]:
grader.check("q62")

Where did `imdb.csv` come from? Take a look at [this lab's folder](./). You should see a file called `imdb.csv`.

Open up the `imdb.csv` file in that folder and look at the format. What do you notice? The `.csv` filename ending says that this file is in the [CSV (comma-separated value) format](http://edoceo.com/utilitas/csv-file-format).

## 7. More table operations!

Now that you've worked with arrays, let's add a few more methods to our list of table operations.

### `column`

`column` takes the column name of a table (in string format) as its argument and returns the values in that column as an **array**. 

In [None]:
# Returns an array of movie names
top_10_movies.column('Name')

### `take`
The table method `take` takes as its argument an array of numbers.  Each number should be the index of a row in the table.  It returns a **new table** with only those rows. 

You'll usually want to use `take` in conjunction with `np.arange` to take the first few rows of a table.

In [None]:
# Take first 5 movies of top_10_movies
top_10_movies.take(np.arange(0, 5, 1))

The next three questions will give you practice with combining the operations you've learned in this lab and the previous one to answer questions about the `population` and `imdb` tables. First, check out the `population` table from section 2.

In [None]:
# Run this cell to display the population table.
population

<span style="color:blue">**Question 7.0.1**</span> Check out the `population` table from section 2 of this lab.  Compute the year when the world population first went above 6 billion. Assign the year to `year_population_crossed_6_billion`.

<!--
BEGIN QUESTION
name: q71
-->

In [None]:
year_population_crossed_6_billion = ...
year_population_crossed_6_billion

In [None]:
grader.check("q71")

<span style="color:blue">**Question 7.0.2**</span> Find the average rating for movies released before the year 2000 and the average rating for movies released in the year 2000 or after for the movies in `imdb`.

*Hint*: Think of the steps you need to do (take the average, find the ratings, find movies released in 20th/21st centuries), and try to put them in an order that makes sense.

<!--
BEGIN QUESTION
name: q72
-->

In [None]:
before_2000 = ...
after_or_in_2000 = ...
print("Average before 2000 rating:", before_2000)
print("Average after or in 2000 rating:", after_or_in_2000)

In [None]:
grader.check("q72")

<span style="color:blue">**Question 7.0.3**</span> Here's a challenge: Find the number of movies that came out in *even* years.

*Hint:* The operator `%` computes the remainder when dividing by a number.  So `5 % 2` is 1 and `6 % 2` is 0.  A number is even if the remainder is 0 when you divide by 2.

*Hint 2:* `%` can be used on arrays, operating elementwise like `+` or `*`.  So `make_array(5, 6, 7) % 2` is `array([1, 0, 1])`.

*Hint 3:* Create a column called "Year Remainder" that's the remainder when each movie's release year is divided by 2.  Make a copy of `imdb` that includes that column (`imdb.with_column(...)` returns a new table).  Then use `where` to find rows where that new column is equal to 0.  Then use `num_rows` to count the number of such rows.

*Note:* These steps can be chained in one single statement, or broken up across several lines with intermediate names assigned. You’re always welcome to break down problems however you wish!

<!--
BEGIN QUESTION
name: q73
-->

In [None]:
num_even_year_movies = ...
num_even_year_movies

In [None]:
grader.check("q73")

## 8. Summary

For your reference, here's a table of all the functions and methods we saw in this lab. We'll learn more methods to add to this table in the coming week!

|Name|Example|Purpose|
|-|-|-|
|`sort`|`tbl.sort("N")`|Create a copy of a table sorted by the values in a column|
|`where`|`tbl.where("N", are.above(2))`|Create a copy of a table with only the rows that match some *predicate*|
|`num_rows`|`tbl.num_rows`|Compute the number of rows in a table|
|`num_columns`|`tbl.num_columns`|Compute the number of columns in a table|
|`select`|`tbl.select("N")`|Create a copy of a table with only some of the columns|
|`drop`|`tbl.drop("N")`|Create a copy of a table without some of the columns|
|`column`|`tbl.column("N")`|Creates an array made up of the values of a given column in a table|
|`take`|`tbl.take(N)`|Creates a new table with selected rows given by the provided index|

<br/>

Alright! You're finished with lab 2!  Be sure to...
- run all the tests (the next cell has a shortcut for that), 
- **Save and Checkpoint** from the `File` menu,
- **run the last cell to submit your work**,

This lab is altered from the original [Berkeley data-8 course](http://data8.org/), which is licensed under the [Creative Commons license](https://creativecommons.org/licenses/by-nc/4.0/).

We got our `trees.csv` dataset from [the open Edmonton portal](https://data.edmonton.ca/)

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

Be sure to run the tests and verify that they all pass. **Just because the tests pass does not mean it's the right answer**, the tests will sometimes give a hint or notify you if you missed a question. Then choose **Save and Checkpoint** from the **File** menu, then run the final cell to create the .zip file.
**The .zip file is what you will hand in on eclass.**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export()