# Arrays
Welcome to lab 5!

Last time we saw a new *data type*: the string, used to represent text data.  We also saw that strings had functions associated with them (*methods*) that could be accessed with dots.

Until today, we haven't done much that you couldn't do yourself by hand, without going through the trouble of learning Python.  Computers are most useful when you can use a small amount of code to *do the same action* to *many different things*.

Today, you'll see another new data type: the *array*.  Arrays will let us work with datasets in Python -- *collections* of data, like the numbers 2, 3, 4, and 5, or the words "welcome", "to", and "lab".  This will enable you to do things you couldn't possibly do without a computer.

Initialize the OK tests to get started.

In [None]:
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
from lab_functions import *

from client.api.notebook import Notebook
ok = Notebook('lab05.ok')
_ = ok.auth(inline=True)

**Submission**: Once you're finished, select "Save and Checkpoint" in the File menu and then execute the submit cell below (or at the end). The result will contain a link that you can use to check that your assignment has been submitted successfully. 

In [None]:
_ = ok.submit()

# 1. Arrays
In the time it takes you to calculate the 18% tip on a restaurant bill, a typical laptop can calculate 18% tips for every restaurant bill paid by every human on Earth that day.  (That's if you're pretty fast at doing arithmetic in your head!)

**Arrays** are how we put many values in one place so that we can operate on them as a group. For example, if `billions_of_numbers` is an array of numbers, the expression

    .18 * billions_of_numbers

gives a new array of numbers that's the result of multiplying each number in `billions_of_numbers` by .18 (18%).  Arrays are not limited to numbers; we can also put all the words in a book into an array of strings.

Concretely, an array is a **collection of values of the same type**, like a column in an Excel spreadsheet.

<img src="excel_array.jpg">

Though it contains other values, an array is also a value, just like a `str` or a `float`.

## 1.1. Making arrays - the hard way
You can type in the data that goes in an array yourself, but that's not typically how programs work. Normally, we create arrays by loading them from an external source, like a data file.

For now, let's learn how to do it the hard way. Execute the following cell so that all the names from the `datascience` module are available to you.

In [None]:
from datascience import *

Now, to create an array, call the function `make_array`.  Each argument you pass to `make_array` will be in the array it returns.  Run this cell to see an example:

In [None]:
make_array(0.125, 4.75, -1.3)

Each value in an array (in the above case, the numbers 0.125, 4.75, and -1.3) is called an *element* of that array.

Arrays themselves are also values, just like numbers and strings.  That means you can assign them names or use them as arguments to functions.

**Question 1.1.1.** Make an array containing the numbers 1, 2, and 3, in that order.  Name it `small_numbers`.

In [None]:
small_numbers = make_array(1, 2, 3) #SOLUTION
small_numbers

In [None]:
_ = ok.grade('q111')

**Question 1.1.2.** Make an array containing the numbers 0, 1, -1, $\pi$, and $e$, in that order.  Name it `interesting_numbers`.  *Hint:* How did you get the values $\pi$ and $e$ earlier?  You can refer to them in exactly the same way here.  You might have to `import` something...

In [None]:
import math #SOLUTION
interesting_numbers = make_array(0, 1, -1, math.pi, math.e) #SOLUTION
interesting_numbers

In [None]:
_ = ok.grade('q112')

**Question 1.1.3.** Make an array containing the five strings `"Hello"`, `","`, `" "`, `"world"`, and `"!"`.  (The third one is a single space inside quotes.)  Name it `hello_world_components`.

*Note:* If you print `hello_world_components`, you'll notice some extra information in addition to its contents: `dtype='<U5'`.  That's just NumPy's extremely cryptic way of saying that the things in the array are strings.

In [None]:
hello_world_components = make_array("Hello", ",", " ", "world", "!") #SOLUTION
hello_world_components

In [None]:
_ = ok.grade('q113')

The `join` method of a string takes an array of strings as its argument and puts all of the elements together into one string. Try it:

In [None]:
'°'.join(make_array('(╯', '□','）╯︵ ┻━┻'))

**Question 1.1.4.** Assign `separator` to a string so that the name `hello` is bound to the string `'Hello, world!'` in the cell below.

In [None]:
separator = '' # SOLUTION
hello = separator.join(hello_world_components)
hello

In [None]:
_ = ok.grade('q114')

### 1.1.1.  `np.arange`
Arrays are provided by a package called [NumPy](http://www.numpy.org/) (pronounced "NUM-pie" or, if you prefer to pronounce things incorrectly, "NUM-pee").  The package is called `numpy`, but it's standard to rename it `np` for brevity.  You can do that with:

    import numpy as np

Very often in data science, we want to work with many numbers that are evenly spaced within some range.  NumPy provides a special function for this called `arange`.  `np.arange(start, stop, space)` produces an array with all the numbers starting at `start` and counting up by `space`, stopping before `stop` is reached.

For example, the value of `np.arange(1, 8, 2)` is an array with elements 1, 3, 5, and 7 -- it starts at 1 and counts up by 2, then stops before 8.  In other words, it's equivalent to `make_array(1, 3, 5, 7)`.

`np.arange(4, 9, 1)` is an array with elements 4, 5, 6, 7, and 8.  (It doesn't contain 9 because `np.arange` stops *before* the stop value is reached.)

**Question 1.1.1.1.** Import `numpy` as `np` and then use `np.arange` to create an array with the multiples of 99 from 0 up to (**and including**) 9999.  (So its elements are 0, 99, 198, 297, etc.)

In [None]:
import numpy as np # SOLUTION
multiples_of_99 = np.arange(0, 9999+99, 99) # SOLUTION
multiples_of_99

In [None]:
_ = ok.grade('q1111')

##### Temperature readings
NOAA (the US National Oceanic and Atmospheric Administration) operates weather stations that measure surface temperatures at different sites around the United States.  The hourly readings are [publicly available](http://www.ncdc.noaa.gov/qclcd/QCLCD?prior=N).

Suppose we download all the hourly data from the Oakland, California site for the month of December 2015.  To analyze the data, we want to know when each reading was taken, but we find that the data don't include the timestamps of the readings (the time at which each one was taken).

However, we know the first reading was taken at the first instant of December 2015 (midnight on December 1st) and each subsequent reading was taken exactly 1 hour after the last.

**Question 1.1.1.2.** Create an array of the *time, in seconds, since the start of the month* at which each hourly reading was taken.  Name it `collection_times`.

*Hint 1:* There were 31 days in December, which is equivalent to ($31 \times 24$) hours or ($31 \times 24 \times 60 \times 60$) seconds.  So your array should have $31 \times 24$ elements in it.

*Hint 2:* The `len` function works on arrays, too.  If your `collection_times` isn't passing the tests, check its length and make sure it has $31 \times 24$ elements.

In [None]:
collection_times = np.arange(0, 31*24*60*60, 60*60) # SOLUTION
collection_times

In [None]:
_ = ok.grade('q1112')

## 1.2. Working with single elements of arrays ("indexing")
Let's work with a more interesting dataset.  The next cell creates an array called `population` that includes estimated world populations in every year from **1950** to roughly the present.  (The estimates come from the [US Census Bureau website](http://www.census.gov/population/international/data/worldpop/table_population.php).)

Rather than type in the data manually, we've loaded them from a file on your computer called `world_population.csv`.  You'll learn how to do that next week.

In [None]:
# Don't worry too much about what goes on in this cell.
from datascience import *
population = Table.read_table("world_population.csv").column("Population")
population

Here's how we get the first element of `population`, which is the world population in the first year in the dataset, 1950.

In [None]:
population.item(0)

The value of that expression is the number 2557628654 (around 2.5 billion), because that's the first thing in the array `population`.

Notice that we wrote `.item(0)`, not `.item(1)`, to get the first element.  This is a weird convention in computer science.  0 is called the *index* of the first item.  It's the number of elements that appear *before* that item.  So 3 is the index of the 4th item.

Here are some more examples.  In the examples, we've given names to the things we get out of `population`.  Read and run each cell.

In [None]:
# The third element in the array is the population
# in 1952.
population_1952 = population.item(2)
population_1952

In [None]:
# The thirteenth element in the array is the population
# in 1962 (which is 1950 + 12).
population_1962 = population.item(12)
population_1962

In [None]:
# The 66th element is the population in 2015.
population_2015 = population.item(65)
population_2015

In [None]:
# The array has only 66 elements, so this doesn't work.
# (There's no element with 66 other elements before it.)
population_2016 = population.item(66)
population_2016

In [None]:
# Since make_array returns an array, we can call .item(3)
# on its output to get its 4th element, just like we
# "chained" together calls to the method "replace" earlier.
make_array(-1, -3, 4, -2).item(3)

**Question 1.2.1.** Set `population_1973` to the world population in 1973, by getting the appropriate element from `population` using `item`.

In [None]:
population_1973 = population.item(23) #SOLUTION
population_1973

In [None]:
_ = ok.grade('q121')

### 1.3. Working with the whole array
NumPy provides many functions that operate on arrays.  Here are two:

* `np.mean`: The average value of the numbers in an array.  More precisely: Takes a single array as its argument.  Returns a single floating-point number, the average of the numbers in the argument array.
* `np.diff`: Finds the increase between successive elements of an array.  More precisely: Takes a single array as its argument.  Returns an array that's one shorter.  Each element of the returned array is the difference between an element of the argument array and the next element of the argument array.
at number.

**Question 1.3.1.** Compute the average world population over the years 1950 to 2015.

In [None]:
average_world_population = np.mean(population) #SOLUTION
average_world_population

In [None]:
_ = ok.grade('q131')

**Question 1.3.2.** Compute the increase in population in each year from 1950 to 2014.  (Your answer should be an array, and its first element, for example, should be the increase in population from 1950 to 1951.)

In [None]:
population_changes = np.diff(population) #SOLUTION
population_changes

In [None]:
_ = ok.grade('q132')

**Question 1.3.3.** Compute the average change in population over the years from 1950 to 2014.

*Note:* There are several ways to do this.  If you have time, try to figure out two ways.

In [None]:
average_population_change = np.mean(np.diff(population)) #SOLUTION
average_population_change

In [None]:
_ = ok.grade('q133')

### 1.4. Working with each element of the array

Here's a brief preview of next lab.  If you divide an array by a number, you'll get the same array, but with each element divided by the number.  The same is true for multiplication, division, addition, and exponentiation (`**`).

**Question 1.4.1.** The population numbers are hard to read.  Try to make them easier by expressing them in *billions of people*.  That is, create an array called `population_in_billions` that looks like `population`, but with each element being the population *in billions* in the corresponding year.

In [None]:
population_in_billions = population / 1000000000 #SOLUTION
population_in_billions

In [None]:
_ = ok.grade('q141')

**Question 1.4.2.** That's still hard to read.  Look up the documentation for the function `np.round` by typing `np.round?` in the following cell and running the cell.  Then, use `np.round` to round the numbers in `population_in_billions` to 2 decimal places.  Call the resulting array `population_in_billions_rounded`.

In [None]:
population_in_billions_rounded = np.round(population_in_billions, 2) #SOLUTION
population_in_billions_rounded

In [None]:
_ = ok.grade('q142')

## 2. Making bar charts
You can create arrays of strings, too.  One thing you can do with an array of strings is make a *bar chart* out of it.  There's a bar for each unique string in the array, and the size of each bar shows the number of times that string appears.

We've provided a function called `draw_group_barh` just for this lab.  It takes two arguments.  The first argument is a single string, and it's used as the title of the chart.  The second argument is an array of strings.  The function doesn't return anything, but rather displays a horizontal bar chart of the strings.  (It's similar to `print` in not having a return value, except that it displays a picture rather than printing text.)

The cell below loads part of the comic book gender dataset from the textbook.  It uses a function we've defined for this lab; normally loading a dataset requires a bit more work.  Run the cell to continue.

In [None]:
# This might take a minute or two.
marvel_url = "https://github.com/fivethirtyeight/data/raw/master/comic-characters/marvel-wikia-data.csv"
genders = load_and_clean_table(marvel_url).column("Gender")
genders

Use the `type` function to figure out what `genders` is, if you're not sure.

**Question 2.1.** Use the `len` function to figure out the size of the dataset.  (Look it up with `len?` if necessary.)  Assign the name `num_marvel_characters` to that size.

In [None]:
num_marvel_characters = len(genders) #SOLUTION
num_marvel_characters

In [None]:
_ = ok.grade('q21')

**Question 2.2.** We can't really read the whole `genders` array - it's much too big.  Instead, create a bar chart displaying the count of each gender.

In [None]:
draw_group_barh("Marvel character genders", genders)

Congratulations, you're done with the lab!

Be sure to **save this notebook** and then run the two cells below to check and submit your work.

In [None]:
# For your convenience, you can run this cell to run all the tests at once.

_ = ok.grade_all()

In [None]:
_ = ok.submit()