# Lab 2: Data Types
Welcome to lab 2!

Last time, we had our first look at Python and Jupyter notebooks.  So far, we've only used Python to manipulate single elements (like ints and strings) at a time, but when doing data science we'll often want to manipulate multiple elements at once.

In this lab, you'll see how to work with datasets in Python -- *collections* of data, like the numbers 2 through 5 or the words "welcome", "to", and "lab".

Initialize the OK tests to get started.

In [2]:
# from client.api.notebook import Notebook
# ok = Notebook('lab02.ok')
# _ = ok.auth(inline=True)

**Deadline**: If you are not attending lab physically, you have to complete this lab and submit by Tuesday, 8/29 11:59pm in order to receive lab credit. Otherwise, please attend the lab you are enrolled in, get the check-off with your (u)GSI or learning assistant **AND** submit this assignment (with whatever progress you've made) to receive lab credit.

**Submission**: Once you're finished, select "Save and Checkpoint" in the File menu and then execute the submit cell below (or at the end). The result will contain a link that you can use to check that your assignment has been submitted successfully. 

In [None]:
_ = ok.submit()

# 1. Arrays

Up to now, we haven't done much that you couldn't do yourself by hand, without going through the trouble of learning Python.  Computers are most useful when you can use a small amount of code to *do the same action* to *many different things*.

For example, in the time it takes you to calculate the 18% tip on a restaurant bill, a laptop can calculate 18% tips for every restaurant bill paid by every human on Earth that day.  (That's if you're pretty fast at doing arithmetic in your head!)

**Arrays** are how we put many values in one place so that we can operate on them as a group. For example, if `billions_of_numbers` is an array of numbers, the expression

    .18 * billions_of_numbers

gives a new array of numbers that's the result of multiplying each number in `billions_of_numbers` by .18 (18%).  Arrays are not limited to numbers; we can also put all the words in a book into an array of strings.

Concretely, an array is a **collection of values of the same type**, like a column in an Excel spreadsheet. 

<img src="excel_array.jpg">

## 1.1. Making arrays
You can type in the data that goes in an array yourself, but that's not typically how programs work. Normally, we create arrays by loading them from an external source, like a data file.

First, though, let's learn how to do it the hard way. Execute the following cell so that all the names from the `datascience` module are available to you.

In [2]:
from datascience import *

Now, to create an array, call the function `make_array`.  Each argument you pass to `make_array` will be in the array it returns.  Run this cell to see an example:

In [36]:
make_array(0.125, 4.75, -1.3)

Each value in an array (in the above case, the numbers 0.125, 4.75, and -1.3) is called an *element* of that array.

Arrays themselves are also values, just like numbers and strings.  That means you can assign them names or use them as arguments to functions.

**Question 1.1.1.** Make an array containing the numbers 1, 2, and 3, in that order.  Name it `small_numbers`.

In [37]:
small_numbers = ...
small_numbers

In [38]:
_ = ok.grade('q111')

**Question 1.1.2.** Make an array containing the numbers 0, 1, -1, $\pi$, and $e$, in that order.  Name it `interesting_numbers`.  *Hint:* How did you get the values $\pi$ and $e$ earlier?  You can refer to them in exactly the same way here.

In [39]:
interesting_numbers = ...
interesting_numbers

In [40]:
_ = ok.grade('q112')

**Question 1.1.3.** Make an array containing the five strings `"Hello"`, `","`, `" "`, `"world"`, and `"!"`.  (The third one is a single space inside quotes.)  Name it `hello_world_components`.

*Note:* If you print `hello_world_components`, you'll notice some extra information in addition to its contents: `dtype='<U5'`.  That's just NumPy's extremely cryptic way of saying that the things in the array are strings.

In [41]:
hello_world_components = ...
hello_world_components

In [1]:
_ = ok.grade('q113')

The `join` method of a string takes an array of strings as its argument and puts all of the elements together into one string. Try it:

In [43]:
'°'.join(make_array('(╯', '□','）╯︵ ┻━┻'))

**Question 1.1.4.** Assign `separator` to a string so that the name `hello` is bound to the string `'Hello, world!'` in the cell below.

In [44]:
separator = ...
hello = separator.join(hello_world_components)
hello

In [45]:
_ = ok.grade('q114')

**Question 1.1.5.** We mentioned above that arrays are collections of values of the same type. But what happens if we try to make an array of different types? When we initialize a numpy array, the values are implicitly cast to match each other. To see exactly how this works, set `cast_array` to the result of calling `make_array` on a number and a string.

In [15]:
cast_array = ...
cast_array

### 1.2.  `np.arange`
Arrays are provided by a package called [NumPy](http://www.numpy.org/) (pronounced "NUM-pie" or, if you prefer to pronounce things incorrectly, "NUM-pee").  The package is called `numpy`, but it's standard to rename it `np` for brevity.  You can do that with:

    import numpy as np

Very often in data science, we want to work with many numbers that are evenly spaced within some range.  NumPy provides a special function for this called `arange`.  `np.arange(start, stop, space)` produces an array with all the numbers starting at `start` and counting up by `space`, stopping before `stop` is reached.

For example, the value of `np.arange(1, 6, 2)` is an array with elements 1, 3, and 5 -- it starts at 1 and counts up by 2, then stops before 6.  In other words, it's equivalent to `make_array(1, 3, 5)`.

`np.arange(4, 9, 1)` is an array with elements 4, 5, 6, 7, and 8.  (It doesn't contain 9 because `np.arange` stops *before* the stop value is reached.)

**Question 1.2.1.** Import `numpy` as `np` and then use `np.arange` to create an array with the multiples of 99 from 0 up to (**and including**) 9999.  (So its elements are 0, 99, 198, 297, etc.)

In [4]:
...
multiples_of_99 = ...
multiples_of_99

In [47]:
_ = ok.grade('q121')

##### Temperature readings
NOAA (the US National Oceanic and Atmospheric Administration) operates weather stations that measure surface temperatures at different sites around the United States.  The hourly readings are [publicly available](http://www.ncdc.noaa.gov/qclcd/QCLCD?prior=N).

Suppose we download all the hourly data from the Oakland, California site for the month of December 2015.  To analyze the data, we want to know when each reading was taken, but we find that the data don't include the timestamps of the readings (the time at which each one was taken).

However, we know the first reading was taken at the first instant of December 2015 (midnight on December 1st) and each subsequent reading was taken exactly 1 hour after the last.

**Question 1.2.2.** Create an array of the *time, in seconds, since the start of the month* at which each hourly reading was taken.  Name it `collection_times`.

*Hint 1:* There were 31 days in December, which is equivalent to ($31 \times 24$) hours or ($31 \times 24 \times 60 \times 60$) seconds.  So your array should have $31 \times 24$ elements in it.

*Hint 2:* The `len` function works on arrays, too.  If your `collection_times` isn't passing the tests, check its length and make sure it has $31 \times 24$ elements.

In [48]:
collection_times = ...
collection_times

In [49]:
_ = ok.grade('q121')

## 1.3. Working with single elements of arrays ("indexing")
Let's work with a more interesting dataset.  The next cell creates an array called `population` that includes estimated world populations in every year from **1950** to roughly the present.  (The estimates come from the [US Census Bureau website](http://www.census.gov/population/international/data/worldpop/table_population.php).)

Rather than type in the data manually, we've loaded them from a file on your computer called `world_population.csv`.  You'll learn how to do that next week.

In [75]:
# Don't worry too much about what goes on in this cell.
from datascience import *
population = Table.read_table("world_population.csv").column("Population")
population

Here's how we get the first element of `population`, which is the world population in the first year in the dataset, 1950.

In [51]:
population.item(0)

The value of that expression is the number 2557628654 (around 2.5 billion), because that's the first thing in the array `population`.

Notice that we wrote `.item(0)`, not `.item(1)`, to get the first element.  This is a weird convention in computer science.  0 is called the *index* of the first item.  It's the number of elements that appear *before* that item.  So 3 is the index of the 4th item.

Here are some more examples.  In the examples, we've given names to the things we get out of `population`.  Read and run each cell.

In [52]:
# The third element in the array is the population
# in 1952.
population_1952 = population.item(2)
population_1952

In [53]:
# The thirteenth element in the array is the population
# in 1962 (which is 1950 + 12).
population_1962 = population.item(12)
population_1962

In [54]:
# The 66th element is the population in 2015.
population_2015 = population.item(65)
population_2015

In [55]:
# The array has only 66 elements, so this doesn't work.
# (There's no element with 66 other elements before it.)
population_2016 = population.item(66)
population_2016

In [None]:
# Since make_array returns an array, we can call .item(3)
# on its output to get its 4th element, just like we
# "chained" together calls to the method "replace" earlier.
make_array(-1, -3, 4, -2).item(3)

**Question 1.3.1.** Set `population_1973` to the world population in 1973, by getting the appropriate element from `population` using `item`.

In [56]:
population_1973 = ...
population_1973

In [57]:
_ = ok.grade('q131')

## 1.4. Doing something to every element of an array
Arrays are primarily useful for doing the same operation many times, so we don't often have to use `.item` and work with single elements.

##### Logarithms
Here is one simple question we might ask about world population:

> How big was the population in *orders of magnitude* in each year?

The logarithm function is one way of measuring how big a number is. The logarithm (base 10) of a number increases by 1 every time we multiply the number by 10. It's like a measure of how many decimal digits the number has, or how big it is in orders of magnitude.

We could try to answer our question like this, using the `log10` function from the `math` module and the `item` method you just saw:

In [58]:
population_1950_magnitude = math.log10(population.item(0))
population_1951_magnitude = math.log10(population.item(1))
population_1952_magnitude = math.log10(population.item(2))
population_1953_magnitude = math.log10(population.item(3))
...

But this is tedious and doesn't really take advantage of the fact that we are using a computer.

Instead, NumPy provides its own version of `log10` that takes the logarithm of each element of an array.  It takes a single array of numbers as its argument.  It returns an array of the same length, where the first element of the result is the logarithm of the first element of the argument, and so on.

**Question 1.4.1.** Use it to compute the logarithms of the world population in every year.  Give the result (an array of 66 numbers) the name `population_magnitudes`.  Your code should be very short.

In [78]:
population_magnitudes = ...
population_magnitudes

In [60]:
_ = ok.grade('q141')

<img src="array_logarithm.jpg">

This is called *elementwise* application of the function, since it operates separately on each element of the array it's called on.  The textbook's section on arrays has a useful list of NumPy functions that are designed to work elementwise, like `np.log10`.

##### Arithmetic
Arithmetic also works elementwise on arrays.  For example, you can divide all the population numbers by 1 billion to get numbers in billions:

In [61]:
population_in_billions = population / 1000000000
population_in_billions

You can do the same with addition, subtraction, multiplication, and exponentiation (`**`). For example, you can calculate a tip on several restaurant bills at once (in this case just 3):

In [62]:
restaurant_bills = make_array(20.12, 39.90, 31.01)
print("Restaurant bills:\t", restaurant_bills)
tips = .2 * restaurant_bills
print("Tips:\t\t\t", tips)

<img src="array_multiplication.jpg">

**Question 1.4.2.** Suppose the total charge at a restaurant is the original bill plus the tip.  That means we can multiply the original bill by 1.2 to get the total charge.  Compute the total charge for each bill in `restaurant_bills`.

In [63]:
total_charges = ...
total_charges

In [64]:
_ = ok.grade('q142')

**Question 1.4.3.** `more_restaurant_bills.csv` contains 100,000 bills!  Compute the total charge for each one.  How is your code different?

In [65]:
more_restaurant_bills = Table.read_table("more_restaurant_bills.csv").column("Bill")
more_total_charges = ...
more_total_charges

In [66]:
_ = ok.grade('q433')

The function `sum` takes a single array of numbers as its argument.  It returns the sum of all the numbers in that array (so it returns a single number, not an array).

**Question 1.4.4.** What was the sum of all the bills in `more_restaurant_bills`, *including tips*?

In [67]:
sum_of_bills = ...
sum_of_bills

In [68]:
_ = ok.grade('q144')

**Question 1.4.5.** The powers of 2 ($2^0 = 1$, $2^1 = 2$, $2^2 = 4$, etc) arise frequently in computer science.  (For example, you may have noticed that storage on smartphones or USBs come in powers of 2, like 16 GB, 32 GB, or 64 GB.)  Use `np.arange` and the exponentiation operator `**` to compute the first 30 powers of 2, starting from `2^0`.

In [97]:
powers_of_2 = ...
powers_of_2

In [2]:
_ = ok.grade('q145')

## 1.5 Doing something to the elements in two different arrays

Often, we'll have two arrays containing pairs of values which we may want to use together. For instance, we might have an array of prices of different products, and an array of percentage discounts to apply to each product. In this case, we can multiply the the two arrays together to get the total discounts on each product:

In [110]:
prices = make_array(5, 8, 12)
discounts = (.2, .4, 0)

discounts_in_dollars = prices * discounts
discounts_in_dollars

<img src="array_array.png">
### Question 1.5.1

Using the `prices` and `discounts_in_dollars` arrays defined above, set `discounted_prices` to the prices of each product after the discount.

In [113]:
discounted_prices = ...

### Question 1.5.2

As we've already seen, arrays are great for generating mathematical sequences. Set `multiples` to the sequence

$$ 1\times2,~ 2\times3,~ 3\times4,~ \ldots,~ 100\times101 $$

You can use multiple lines if you want.

In [114]:
multiples = ...
multiples

One common error we'll see when using arrays is a `length mismatch error`, when we try to operate on two arrays of different lengths. Numpy will often say that two arrays could not be `broadcast` together, which is just another way of saying their lenghs are incompatible.

### Question 1.5.3
Take a look at the cell below and figure out what's wrong with it. Then, change one of the two arrays so `error_array` is set to $2,4,6,8$.

In [11]:
error_array = make_array(1,2,3,4) + make_array(1,2,3)
error_array

# 2. Plotting information

One time arrays are very useful is when making graphs. Two uses of this are to plot data to understand it better, and to help us visualize functions. Below, we'll walk through how we can do both. We'll use `pyplot`'s `plot` function; import it below.

In [17]:
from matplotlib.pyplot import plot
# This line is a special Jupyter command which tells the notebook how to print out graphs:
%matplotlib inline

## 2.1 Plotting Data

The `plot` function takes two arguments:
* An array of $x$ coordinates
* An array of $y$ coordinates

For example, below we plot population vs time using the data from question 1.

In [86]:
years = np.arange(1950, 2016)
plot(years, population)

### Question 2.1.1
Below, plot the magnitude of the population (from Q1.4.1) against the year. You'll need to set `test` equal to the return value of the plot function, but this is only for testing purposes.

In [103]:
years = np.arange(1950, 2016)
# Replace the ... with a call to plot. The 'test=' is for testing purposes.
test = ...

### Question 2.1.2
At first glance, it looks like populations have increased fairly regularly over the last 60 years. When dealing with "time series" like this, it's often helpful to not just look at values, but also changes in values over time. Below we create two new arrays, one with the populations between 1950 and 2014, and one with populations between 1951 and 2015 (to do this we use array slicing, which you won't need to know in this class).

Set `changes` to an array of how much the population increased each year between 1950 and 2014. 

In [108]:
change_years = np.arange(1950, 2015)

pops_1950_2014 = population[:-1]
pops_1951_2015 = population[1:]

changes = ...


plot(change_years, changes)

## 2.2. Plotting functions

Another use of `plot` is to plot mathmatical functions. This can be very useful if you have the formula for a function but are having trouble visualizing it.

### Question 2.2.1
We'll start by plotting the function $y = x^2$ for a few values of $x$ between -5 and 5. Set $y$ appropriately.

In [26]:
x = np.arange(-5, 5, 1)
y = ...
plot(x, y)

### Question 2.2.2

This looks okay, but the function $y=x^2$ is smooth, while the graph above looks a little jagged around the $x$ values we selected. One way to fix this is to increase the "resolution" by using more $x$ values in the same range. 

Below, set `finer_x` to be the range of all values between -5 and 5, taking steps of size .1. Then set finer_y appropriately to graph $y=x^2$ again. (In your graph you may see a small gap at the top right. That's okay, it's just our graphing package making a poor choice of axis lengths).

In [30]:
finer_x = ...
finer_y = ...
plot(finer_x, finer_y)

### Question 2.2.3

Now we've got a pretty nice looking graph! But we picked a boring function to start. Let's try a function that's harder to visualize. Below, set `crazy_y` so we plot
$$y=x\sin(x) + x$$
between $-100$ and $100$.

Hint: You can use `np.sin`.

In [73]:
crazy_x = np.arange(-100, 100, .1)

crazy_y = ...

plot(crazy_x, crazy_y)

## 3. Introduction to Tables

For a collection of things in the world, an array is useful for describing a single attribute of each thing. For example, among the collection of US States, an array could describe the land area of each. Tables extend this idea by describing multiple attributes for each element of a collection.

In most data science applications, we have data about many entities, but we also have several kinds of data about each entity.

For example, in the cell below we print two arrays. The first one contains the world population in each year, from above, and the second contains the years themselves (in order, so the first elements in the population and the years arrays correspond).

In [5]:
years = np.arange(1950, 2015+1)
print("Population column:", population)
print("Years column:", years)

Suppose we want to answer this question:

> When did world population cross 6 billion?

You could technically answer this question just from staring at the arrays, but it's a bit convoluted, since you would have to count the position where the population first crossed 6 billion, then find the corresponding element in the years array. In cases like these, it might be easier to put the data into a *`Table`*, a 2-dimensional type of dataset. 

The expression below:

- creates an empty table using the expression `Table()`,
- adds two columns by calling `with_columns` with four arguments,
- assignes the result to the name `population`, and finally
- evaluates `population` so that we can see the table.

The strings `"Year"` and `"Population"` are column labels that we have chosen. Ther names `population_amounts` and `years` were assigned above to two arrays of the same length. The function `with_columns` (you can find the documentation [here](http://data8.org/datascience/tables.html)) takes in alternating strings (to represent column labels) and arrays (representing the data in those columns), which are all separated by commas.

In [6]:
population = Table().with_columns(
    "Population", population_amounts,
    "Year", years
)
population

Now the data are all together in a single table! It's much easier to parse this data--if you need to know what the population was in 1959, for example, you can tell from a single glance. We'll revisit this table later.

## 3.1 Creating Tables
### Question 3.1.1.
In the cell below, we've created 2 arrays. Using the steps above, assign `top_10_movies` to a table that has two columns called "Rating" and "Name", which hold `top_10_movie_ratings` and `top_10_movie_names` respectively.

In [7]:
top_10_movie_ratings = make_array(9.2, 9.2, 9., 8.9, 8.9, 8.9, 8.9, 8.9, 8.9, 8.8)
top_10_movie_names = make_array(
        'The Shawshank Redemption (1994)',
        'The Godfather (1972)',
        'The Godfather: Part II (1974)',
        'Pulp Fiction (1994)',
        "Schindler's List (1993)",
        'The Lord of the Rings: The Return of the King (2003)',
        '12 Angry Men (1957)',
        'The Dark Knight (2008)',
        'Il buono, il brutto, il cattivo (1966)',
        'The Lord of the Rings: The Fellowship of the Ring (2001)')

top_10_movies = ...
# We've put this next line here so your table will get printed out when you
# run this cell.
top_10_movies

In [2]:
_ = ok.grade('q311')

#### Loading a table from a file
In most cases, we aren't going to go through the trouble of typing in all the data manually. Instead, we can use our `Table` functions.

`Table.read_table` takes one argument, a path to a data file (a string) and returns a table.  There are many formats for data files, but CSV ("comma-separated values") is the most common.

### Question 3.1.2. 
The file `imdb.csv` contains a table of information about the 250 highest-rated movies on IMDb.  Load it as a table called `imdb`.

In [9]:
imdb = ...
imdb

In [2]:
_ = ok.grade('q312')

Notice the part about "... (240 rows omitted)."  This table is big enough that only a few of its rows are displayed, but the others are still there.  10 are shown, so there are 250 movies total.

Where did `imdb.csv` come from? Take a look at [this lab's folder](./). You should see a file called `imdb.csv`.

Open up the `imdb.csv` file in that folder and look at the format. What do you notice? The `.csv` filename ending says that this file is in the [CSV (comma-separated value) format](http://edoceo.com/utilitas/csv-file-format).

Next week, we'll talk about how to use these tables to extract useful information from our data.

Congratulations, you're done with lab 2!  Be sure to 
- **run all the tests** (the next cell has a shortcut for that), 
- **Save and Checkpoint** from the `File` menu,
- **run the last cell to submit your work**,
- and ask one of the staff members to check you off.

In [None]:
# For your convenience, you can run this cell to run all the tests at once!
import os
_ = [ok.grade(q[:-3]) for q in os.listdir("tests") if q.startswith('q')]

In [None]:
_ = ok.submit()