In [None]:
# Initialize Otter
import otter
grader = otter.Notebook()

# Problem Set 0 - Python & Jupyter Notebooks

In this problem set, you will learn how to:

* navigate jupyter notebooks (like this one).
* write and evaluate some basic *expressions* in python, the computer language of the course.
* call *functions* to use code other people have written and learn about *loops* to iterate code.
* learn data manipulation using the `pandas` package.

For reference, you might find it useful to read [Chapter 3 of the Data 8 textbook](http://www.inferentialthinking.com/chapters/03/programming-in-python.html), [Chapter 1](https://www.inferentialthinking.com/chapters/01/what-is-data-science.html) and [Chapter 2](https://www.inferentialthinking.com/chapters/02/causality-and-experiments.html) are worth skimming as well.

## 1. Jupyter Notebooks

This webpage is called a Jupyter notebook. A notebook is an interactive computing environment to write programs and view their results, and also to write text.

A notebook is thus an editable computer document in which you can write computer programs; view their results; and comment, annotate, and explain what is going on. 
[Project jupyter](https://en.wikipedia.org/wiki/Project_Jupyter) is headquartered here at Berkeley, where its original creator [Fernando Pérez](https://en.wikipedia.org/wiki/Fernando_Pérez_(software_developer)) works; its purpose is to build human-friendly frameworks for interactive computing. 
If you want to see what Fernando looks and sounds like, you can load and watch a 15-minute inspirational video by clicking on "YouTubeVideo" below and then on the `▶` in the toolbar above: 

In [6]:
from IPython.display import YouTubeVideo

YouTubeVideo("Wd6a3JIFH0s")

### 1.1. Text cells

In a notebook, each rectangle containing text or code is called a *cell*.

Text cells (like this one) can be edited by double-clicking on them. They're written in a simple format called  [markdown](http://daringfireball.net/projects/markdown/syntax) to add formatting and section headings. 
You don't need to learn markdown, but you might want to.

After you edit a text cell, click the "run cell" button at the top that looks like `▶` in the toolbar at the top of this window, or hold down `shift` + press `return`, to confirm any changes to the text and formatting. 

**Question 1.1.1.** This paragraph is in its own text cell.  Try editing it so that this sentence is the last sentence in the paragraph, and then click the "run cell" ▶| button or hold down `shift` + `return`.  This sentence, for example, should be deleted.  So should this one.

### 1.2. Code cells

Other cells contain code in the Python 3 language. Running a code cell will execute all of the code it contains.

To run the code in a code cell, first click on that cell to activate it.  It'll be highlighted with a little green or blue rectangle.  Next, either press `▶` or hold down `shift` + press `return`.

Try running this cell:

In [7]:
print("Hello, world!")

And this one:

In [8]:
print("\N{WAVING HAND SIGN}, \N{EARTH GLOBE ASIA-AUSTRALIA}!")

The fundamental building block of Python code is an expression. Cells can contain multiple lines with multiple expressions. When you run a cell, the lines of code are executed in the order in which they appear. Every `print` expression prints a line. Run the next cell and notice the order of the output.

In [9]:
print("First this line is printed,")
print("and then this one.")

### 1.3. Writing notebooks

You can use Jupyter notebooks for your own projects or documents.  When you make your own notebook, you'll need to create your own cells for text and code.

To add a cell, click the + button in the menu bar.  It'll start out as a text cell.  You can change it to a code cell by clicking inside it so it's highlighted, clicking the drop-down box next to the restart (⟳) button in the menu bar, and choosing "Code".

**Question 1.3.1.** Add a code cell below this one.  Write code in it that prints out:
   
    Econometrics is what econometricians do.

Run your cell to verify that it works.

### 1.4. "Errors"

Python is a language, and like natural human languages, it has rules.  It differs from natural language in two important ways:

1. The rules are *simple*. You will get the hang of them in a few weeks.
2. The rules are *rigid*. If you're proficient in a natural language, you can understand a non-proficient speaker, glossing over small mistakes.  A computer running Python code is not smart enough to do that.

Whenever you write code, you'll make mistakes.  When you run a code cell that has errors, Python will sometimes produce error messages to tell you what you did wrong.

Errors are okay; even experienced programmers make many errors.  When you make an error, you just have to find the source of the problem, fix it, and move on.

We have made an error in the next code cell.  Run it and see what happens.

(**Note:** In the toolbar, there is the option to click `Cell > Run All`, which will run all the code cells in this notebook in order. However, the notebook stops running code cells if it hits an error, like the one in the cell just below.)

In [10]:
print("This line is missing something."

You should see something like this (minus our annotations):

<img src="error.jpg" width="1000" />

The last line of the error output attempts to tell you what went wrong.  The *syntax* of a language is its structure, and this `SyntaxError` tells you that you have created an illegal structure.  "`EOF`" means "end of file," so the message is saying Python expected you to write something more (in this case, a right parenthesis) before finishing the cell.

There's a lot of terminology in programming languages, but you don't need to know it all in order to program effectively. 
If you see a cryptic message like this, you can often get by without deciphering it.  (Of course, if you're frustrated, feel free to ask course staff member for help.)

Try to fix the code above so that you can run the cell and see the intended message instead of an error.

In [11]:
print("This line is missing something.")

### 1.5. The kernel

The kernel is a program that executes the code inside your notebook and outputs the results. In the top right of your window, you can see a circle that indicates the status of your kernel. If the circle is empty (⚪), the kernel is idle and ready to execute code. If the circle is filled in (⚫), the kernel is busy running some code. 

Next to every code cell, you'll see some text that says `In [...]`. Before you run the cell, you'll see `In [ ]`. When the cell is running, you'll see `In [*]`. If you see an asterisk (\*) next to a cell that doesn't go away, it's likely that the code inside the cell is taking too long to run, and it might be a good time to interrupt the kernel (discussed below). When a cell is finished running, you'll see a number inside the brackets, like so: `In [1]`. The number corresponds to the order in which you run the cells; so, the first cell you run will show a 1 when it's finished running, the second will show a 2, and so on. 

You may run into problems where your kernel is stuck for an excessive amount of time, your notebook is very slow and unresponsive, or your kernel loses its connection. If this happens, try the following steps:
1. At the top of your screen, click **Kernel**, then **Interrupt**.
2. If that doesn't help, click **Kernel**, then **Restart**. If you do this, you will have to run your code cells from the start of your notebook up until where you paused your work.
3. If that doesn't help, restart your server. First, save your work by clicking **File** at the top left of your screen, then **Save and Checkpoint**. Next, click **Control Panel** at the top right. Choose **Stop My Server** to shut it down, then **Start My Server** to start it back up. Then, navigate back to the notebook you were working on. You'll still have to run your code cells again.

### 1.6. Libraries

There are many add-ons and extensions to the core of Python that are useful to using it to get work done. They are contained in what are called libraries. The rest of this notebook needs three libraries. So let us tell Python to install them. Run the code cell below to do so.

We will first import the packages we need. The following 3 lines imports `numpy`, `pandas`, and `matplotlib` respectively. 
Note that we imported `pandas` as `pd`, `numpy` as `np`, and `matplotlib.pyplot` as plt. 
This means that, for example, when we call functions in `pandas`, we will always reference them with `pd` first. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## 2. Python: Numbers & Variables

Quantitative information arises everywhere in economics. In addition to representing commands to print out lines, expressions can represent numbers and methods of combining numbers. The expression `3.2500` evaluates to the number 3.25. (Run the cell and see.)

In [13]:
3.2500

Notice that we didn't have to `print`. When you run a notebook cell, if the last line has a value, then Jupyter helpfully prints out that value for you. However, it won't print out prior lines automatically.

In [14]:
print(2)
3
4

Above, you should see that 4 is the value of the last expression, 2 is printed, but 3 is lost because it was neither printed nor last.

You don't want to print everything all the time anyway. But if you feel sorry for 3, change the cell above to print it.

### 2.1. Arithmetic

The line in the next cell subtracts.  Its value is what you'd expect.  Run it.

In [15]:
3.25 - 1.5

Many basic arithmetic operations are built into Python.  The Data 8 textbook section on [Expressions](http://www.inferentialthinking.com/chapters/03/1/expressions.html) describes all the arithmetic operators used in the course. 
The common operator that differs from typical math notation is `**`, which raises one number to the power of the other. So, `2**3` stands for $2^3$ and evaluates to 8. 

The order of operations is the same as what you learned in elementary school, and Python also has parentheses.  For example, compare the outputs of the cells below. The second cell uses parentheses for a happy new year!

In [16]:
5+6*5-6*3**2*2**3/4*7

In [17]:
5+(6*5-(6*3))**2*((2**3)/4*7)

In standard math notation, the first expression is

$$5 + 6 \times 5 - 6 \times 3^2 \times \frac{2^3}{4} \times 7,$$

while the second expression is

$$5 + (6 \times 5 - (6 \times 3))^2 \times (\frac{(2^3)}{4} \times 7).$$

**Question 2.1.1.** Write a Python expression in this next cell that's equal to $5 \times (3 \frac{10}{11}) - 50 \frac{1}{3} + 2^{.5 \times 22} - \frac{7}{33} + 4$.  
That's five times three and ten elevenths, minus fifty and a third, plus two to the power of half twenty-two, minus seven thirty-thirds plus four.  
By "$3 \frac{10}{11}$" we mean $3+\frac{10}{11}$, not $3 \times \frac{10}{11}$.

Replace the ellipses (`...`) with your expression.  Try to use parentheses only when necessary.

*Hint:* The correct output should start with a familiar number.

In [18]:
...

### 2.2. Variables

In natural language, we have terminology that lets us quickly reference very complicated concepts.  We don't say, "That's a large mammal with brown fur and sharp teeth!"  Instead, we just say, "Bear!"

In Python, we do this with *assignment statements*. An assignment statement has a name on the left side of an `=` sign and an expression to be evaluated on the right.

In [19]:
ten = 3 * 2 + 4

When you run that cell, Python first computes the value of the expression on the right-hand side, `3 * 2 + 4`, which is the number 10.  Then it assigns that value to the name `ten`.  At that point, the code in the cell is done running.

After you run that cell, the value 10 is bound to the name `ten`:

In [20]:
ten

The statement `ten = 3 * 2 + 4` is not asserting that `ten` is already equal to `3 * 2 + 4`, as we might expect by analogy with math notation.  Rather, that line of code changes what `ten` means; it now refers to the value 10, whereas before it meant nothing at all.

**Question 2.2.1.** Try writing code that uses a name (like `eleven`) that hasn't been assigned to anything.  You'll see an error!

In [21]:
...

A common pattern in Jupyter notebooks is to assign a value to a name and then immediately evaluate the name in the last line in the cell so that the value is displayed as output. 

In [22]:
close_to_pi = 355/113
close_to_pi

Another common pattern is that a series of lines in a single cell will build up a complex computation in stages, naming the intermediate results.

In [23]:
semimonthly_salary = 841.25
monthly_salary = 2 * semimonthly_salary
number_of_months_in_a_year = 12
yearly_salary = number_of_months_in_a_year * monthly_salary
yearly_salary

Names in Python can have letters (upper- and lower-case letters are both okay and count as different letters), underscores, and numbers.  The first character can't be a number (otherwise a name might look like a number).  And names can't contain spaces, since spaces are used to separate pieces of code from each other.

Other than those rules, what you name something doesn't matter *to Python*.  For example, this cell does the same thing as the above cell, except everything has a different name:

In [24]:
a = 841.25
b = 2 * a
c = 12
d = c * b
d

**However**, names are very important for making your code *readable* to yourself and others.  The cell above is shorter, but it's totally useless without an explanation of what it does.

### 2.3. Checking your code

Our notebooks include built-in *tests* to check whether your work is correct. Sometimes, there are multiple tests for a single question, and passing all of them is required to receive credit for the question. Please don't change the contents of the test cells.

Run the next code cell to initialize the tests:

In [2]:
# Run this code cell to initialize the autograder
import otter
grader = otter.Notebook()

Go ahead and attempt Question 2.3.1. Running the cell directly after it will test whether you have assigned `seconds_in_a_decade` correctly. 
If you haven't, this test will tell you the correct answer. Resist the urge to just copy it, and instead try to adjust your expression.

**Question 2.3.1.** Assign the name `seconds_in_a_decade` to the number of seconds between midnight January 1, 2010 and midnight January 1, 2020. Note that there are two leap years in this span of a decade. A non-leap year has 365 days and a leap year has 366 days.

<!--
BEGIN QUESTION
name: q2_3_1
-->

In [67]:
# Change the next line so that it computes the number of seconds in a decade 
# and assigns that number the name, seconds_in_a_decade.

seconds_in_a_decade = ...

# We've put this line in this cell so that it will print the value you've given 
# to seconds_in_a_decade when you run it. You don't need to change this.

seconds_in_a_decade

In [None]:
grader.check("q2_3_1")

If the autograder found that you had set the right variable(s) to the proper value(s) that it expected, well and good: you are probably not far off track. If you failed any of the tests, go back and think again — and if you are still stuck, ask for help.

### 2.4. Sequences (Edit note: maybe we don't teach lists and just jump to arrays like Data 8)

Lists and their siblings, `numpy` arrays, are ordered collections of objects. 
Lists allow us to store groups of variables under one name. 
The order then allows us to access the objects in the list for easy access and analysis. 
If you want an in-depth look at the capabilities of lists, take a look [here](https://www.tutorialspoint.com/python/python_lists.htm).

To initialize a list, you use brackets. Putting objects separated by commas in between the brackets will add them to the list. For example, we can create and name an empty list:

In [69]:
list_example = []
list_example

Or we can create a list with a couple of elements:

In [70]:
list_example_two = [3, 5, 100, "hello"]
list_example_two

### (Edit note: remove concatenation)

We can use the `+` sign between two lists to "concatenate" them together:
    

In [71]:
list_example_three = list_example_two + [1, 3, 6, 'lists', 'are', 'fun', 4]
list_example_three

To access an individual value in the list, simply count from the start of the list, and put the place of the object you want to access in brackets after the name of the list. 
Importantly, you have to start counting from not 1 but 0. 
Thus the initial object of a list has index 0, the second object of a list has index 1, and in the list above the fourth object has index 3:

In [72]:
fourth_element = list_example_three[3]
fourth_element

Lists do not have to be made up of elements of the same kind. 
Indices do not have to be taken one at a time, either. 
Instead, we can take a slice of indices and return the elements at those indices as a separate list. 

Suppose we just want to select out items 4 through 6 from a list. We can do so by 'slicing' from index 4 to 7. 
Note that in Python indexing, the element at the left index number is included while the element at the right number is excluded.

### (Edit note: remove slicing)

In [73]:
selected_list = list_example_three[4:7]
selected_list

We can select out the largest and smallest items of a list via `min` and `max`.

In [74]:
# A list containing six integers.
a_list = [1, 6, 4, 8, 13, 2]

# Another list containing six integers.
b_list = [4, 5, 2, 14, 9, 11]

In [75]:
max(a_list)

In [76]:
min(b_list)

If we want to find out the length of a list, we can simply use the `len` function.

In [77]:
len(a_list)

Numpy arrays are siblings of lists that can be operated on arithmetically with much more versatility than regular lists. 
We can make numpy arrays by passing in a Python list into the function `np.array(...)`.
Let us start by making an array that consists of the numbers from zero to nine.

In [78]:
example_array = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
example_array

This could have been accomplished more quickly using `np.arange(...)`, which automatically creates 'ranges'. 
If only one argument (say `X`) is passed into `np.arange`, it will create an array of length `X` from 0 to `X-1`.

In [79]:
example_array_2 = np.arange(10)
example_array_2

Notably, we can conduct array arithmetic using `numpy` arrays, which 'broadcasts' arithmetic if one of the terms is a scalar.


In [80]:
every_number_plus_3 = example_array + 3
every_number_plus_3

In [81]:
every_number_to_the_power_of_2 = example_array ** 2
every_number_to_the_power_of_2

We can also conduct arithmetic operations between two arrays of the same length.

In [82]:
example_array_doubled = example_array + example_array
example_array_doubled

Be careful about differences between lists and `numpy` arrays. 
For example, multiplying a list and an array by a number produce different results.

In [83]:
print('Multiplying a list by 2: ', 2 * b_list)
print('Multiplying an array by 2: ',2 * example_array_2)

**Question 2.4.1.** Assign the name `weird_list` to a list containing the following numbers in the order given:

1. -2
2. 3 to the power of 4
3. 0
4. 7 divided by 2

<!--
BEGIN QUESTION
name: q2_4_1
-->

In [84]:
weird_list = ...

In [None]:
grader.check("q2_4_1")

**Question 2.4.2.** Assign the name `weird_array` to an array containing the following numbers in the order given:

1. -2
2. 3 to the power of 4
3. 0
4. 7 divided by 2

<!--
BEGIN QUESTION
name: q2_4_2
-->

In [86]:
weird_array = ...

In [None]:
grader.check("q2_4_2")

## 3. Programming

### 3.1. Looping (Edit note: maybe we don't need this)

[Loops](https://www.tutorialspoint.com/python/python_loops.htm) are  useful in manipulating, iterating over, or transforming large lists and arrays. 
The __for__ loop is useful in that it iterates through a list, performing an action at each element. 
The following code cell prints each element in the array `example_array`. 

In [88]:
for element in example_array:
    print("The current element is", element)

Here's another example: the following code cell moves through every element in `example_array`, adds it by 5, and copies this sum to a new list.

In [89]:
new_list = []

for element in example_array:
    new_element = element + 5
    new_list.append(new_element) # <list>.append(X) adds the element X at the end of <list>

new_list

The most important line in the above cell is the `for element in ...` line. 
This statement sets the structure of our  loop, instructing the machine to stop at every element in `example_array`, perform the indicated operations, and then move on. 
Once Python has stopped at every element in `example_array`, the loop is completed and the final line, which displays `new_list`, is executed. 

Note that we did not have to use "element" to refer to whichever index value the loop is currently operating on. We could have called it almost anything. For example:

In [90]:
newer_list = []

for completely_arbitrary_name in example_array:
    newer_element = completely_arbitrary_name + 5
    newer_list.append(newer_element) 

newer_list

Instead of iterating through elements of a sequence directly, for loops can also iterate over ranges of numerical values (thus iterating over the indexes of a list). 
For example:

In [91]:
for i in np.arange(len(example_array)):
    print("The current element + 4 is", example_array[i] + 5)

**Question 3.1.1.** Iterate through every number in the list `every_number_plus_3`, and print out each number subtracted by 3. You should see a sequence from 0 to 9.

<!--
BEGIN QUESTION
name: q3_1_1
-->

In [92]:
for ... in ...:
    new_element = ...
    print(new_element)

### 3.2. Functions

The most common way to combine or manipulate values in Python is by calling functions. Python comes with many built-in functions that perform common operations.

For example, the `abs` function takes a single number as its argument and returns the absolute value of that number. Run the next two cells and see if you understand the output.

In [93]:
abs(5)

In [94]:
abs(-5)

**Example: computing walking distances**. Alan is on the corner of 7th Avenue and 42nd Street in Midtown Manhattan, and he wants to know far he'd have to walk to get to Gramercy School on the corner of 10th Avenue and 34th Street.

He can't cut across blocks diagonally, since there are buildings in the way.  He has to walk along the sidewalks.  Using the map below, he sees he'd have to walk 3 avenues (long blocks) and 8 streets (short blocks).  In terms of the given numbers, he computed 3 as the difference between 7 and 10, *in absolute value*, and 8 similarly.  

Alan also knows that blocks in Manhattan are all about 80m by 274m (avenues are farther apart than streets).  
So in total, he'd have to walk $(80 \times |42 - 34| + 274 \times |7 - 10|)$ meters to get to the park.

<img src="map.jpg" width="1000" />

**Question 3.2.1.** Fill in the line `num_avenues_away = ...` in the next cell so that the cell calculates the distance Alan must walk and gives it the name `manhattan_distance`.  
Everything else has been filled in for you.  **Use the `abs` function.** Also, be sure to run the test cell afterward to test your code.

<!--
BEGIN QUESTION
name: q3_2_1
-->

In [95]:
# Here's the number of streets away:
num_streets_away = abs(42-34)

# Compute the number of avenues away in a similar way:
num_avenues_away = ...

street_length_m = 80
avenue_length_m = 274

# Now we compute the total distance Alan must walk
manhattan_distance = street_length_m * num_streets_away + avenue_length_m * num_avenues_away

# We've included this line so that you see the distance you've computed when you run this cell. 
# You don't need to change it, but you can if you want.
manhattan_distance

In [None]:
grader.check("q3_2_1")

**Multiple arguments**: Some functions take multiple arguments, separated by commas. For example, the built-in `max` function returns the maximum argument passed to it.

In [97]:
max(2, -3, 4, -5)

### 3.3. Nested expressions
Function calls and arithmetic expressions can themselves contain expressions.  You saw an example in the last question:

    abs(42-34)

has 2 number expressions in a subtraction expression in a function call expression.  And you probably wrote something like `abs(7-10)` to compute `num_avenues_away`.

Nested expressions can turn into complicated-looking code. However, the way in which complicated expressions break down is very regular.

Suppose we are interested in heights that are very unusual.  We'll say that a height is unusual to the extent that it's far away on the number line from the average human height.  [An estimate](http://press.endocrine.org/doi/full/10.1210/jcem.86.9.7875?ck=nck&) of the average adult human height (averaging, we hope, over all humans on Earth today) is 1.688 meters.

So if Kayla is 1.21 meters tall, then her height is $|1.21 - 1.688|$, or $.478$, meters away from the average.  Here's a picture of that:

<img src="numberline_0.png" width="1000" />

And here's how we'd write that in one line of Python code:

In [98]:
abs(1.21 - 1.688)

What's going on here?  `abs` takes just one argument, so the stuff inside the parentheses is all part of that *single argument*.  Specifically, the argument is the value of the expression `1.21 - 1.688`.  The value of that expression is `-.478`.  That value is the argument to `abs`.  The absolute value of that is `.478`, so `.478` is the value of the full expression `abs(1.21 - 1.688)`.

Picture simplifying the expression in several steps:

1. `abs(1.21 - 1.688)`
2. `abs(-.478)`
3. `.478`

In fact, that's basically what Python does to compute the value of the expression.

**Question 3.3.1.** Say that Paola's height is 1.76 meters.  In the next cell, use `abs` to compute the absolute value of the difference between Paola's height and the average human height.  Give that value the name `paola_distance_from_average_m`.

<img src="numberline_1.png" width="1000" />

<!--
BEGIN QUESTION
name: q3_3_1
-->

In [99]:
# Replace the ... with an expression to compute the absolute value of the 
# difference between Paola's height (1.76m) and the average human height.
paola_distance_from_average_m = ...

# Again, we've written this here so that the distance you compute will get 
# printed when you run this cell.
paola_distance_from_average_m

In [None]:
grader.check("q3_3_1")

Now say that we want to compute the more unusual of the two heights.  We'll use the function `max`, which (again) takes two numbers as arguments and returns the larger of the two arguments.  Combining that with the `abs` function, we can compute the larger distance from average among the two heights:

In [101]:
# Just read and run this cell

kayla_height_m = 1.21
paola_height_m = 1.76
average_adult_height_m = 1.688

# The larger distance from the average human height, among the two heights:
larger_distance_m = max(abs(kayla_height_m - average_adult_height_m), abs(paola_height_m - average_adult_height_m))

# Print out our results in a nice readable format:
print("The larger distance from the average height among these two people is", larger_distance_m, "meters.")

The line where `larger_distance_m` is computed looks complicated, but we can break it down into simpler components just like we did before.

The basic recipe is to repeatedly simplify small parts of the expression:

* **Basic expressions:** Start with expressions whose values we know, like names or numbers.
    - Examples: `kayla_height_m` or `5`.
* **Find the next simplest group of expressions:** Look for basic expressions that are directly connected to each other. This can be by arithmetic or as arguments to a function call. 
    - Example: `kayla_height_m - average_adult_height_m`.
* **Evaluate that group:** Evaluate the arithmetic expression or function call. Use the value computed to replace the group of expressions.  
    - Example: `kayla_height_m - average_adult_height_m` becomes `-.478`.
* **Repeat:** Continue this process, using the value of the previously-evaluated expression as a new basic expression. Stop when we've evaluated the entire expression.
    - Example: `abs(-.478)` becomes `.478`, and `max(.478, .072)` becomes `.478`.

You can run the next cell to see a slideshow of that process.

In [102]:
from IPython.display import IFrame
IFrame('https://docs.google.com/presentation/d/e/2PACX-1vTiIUOa9tP4pHPesrI8p2TCp8WCOJtTb3usOacQFPfkEfvQMmX-JYEW3OnBoTmQEJWAHdBP6Mvp053G/embed?start=false&loop=false&delayms=3000', 800, 600)

**Question 3.3.2.** Given the heights of players from the Golden State Warriors, write an expression that computes the smallest difference between any of the three heights. Your expression shouldn't have any numbers in it, only function calls and the names `klay`, `steph`, and `draymond`. 
Give the value of your expression the name `min_height_difference`.

<!--
BEGIN QUESTION
name: q3_3_2
-->

In [103]:
# The three players' heights, in meters:
klay =  2.01 # Klay Thompson is 6'7"
steph = 1.91 # Steph Curry is 6'3"
draymond = 1.98 # Draymond is 6'6"
             
# We'd like to look at all 3 pairs of heights, compute the absolute difference 
# between each pair, and then find the smallest of those 3 absolute differences.  
# This is left to you! If you're stuck, try computing the value for each step 
# of the process (like the difference between Klay's height and Steph's height) 
# on a separate line and giving it a name (like klay_steph_height_diff)
min_height_difference = ...
min_height_difference

In [None]:
grader.check("q3_3_2")

That brings us to the end of the "Python" portion of this problem set. 

The next two sections are much more in the mode of a lecture than an exercise. Follow along to get a taste of the kind of data manipulation and visualization we will be doing using three Python extensions: the numerical Python, Python data analysis, and mathematical plotting libraries `numpy`, `pandas`, and `matplotlib`.

## 4. Data Manipulation

We will be using `pandas` in this course for data manipulation and analysis. `pandas` stores data in the form of something called a `DataFrame`, which is really just another word for table.

### 4.1. Reading in a dataset

Most of the time, the data we want to analyze will be in a separate file, typically as a `.csv` file. 
In this case, we want to read the files in and convert them into a tabular format.

`pandas` has a specific function to read in csv files called `pd.read_csv(<file_path>)`, with the same relative file path as its argument. 

The `<dataframe>.head(...)` function will display the first 5 rows of the data frame by default. 
If you want to specify the number of rows displayed, you can use`dataframe.head(<num_rows>)`.
Similarly, if you want to see the last few rows of the data frame, you can use `dataframe.tail(<num_rows>)`. 
Try it out for yourself!

In [154]:
baby_df = pd.read_csv("baby.csv") # df is short for dataframe
baby_df.head(5)

### 4.2. Series and arrays


One of the primary data types we've encountered so far is the `numpy` array.
In `pandas`, there is a very similar, but slightly different data type called a `Series`.

You can access the values of a particular column by using `<dataframe>['<column_name>']`. 
`<dataframe>['<column_name>']` will return a `Series` instead of an array. Notice that you have to write the column name in quotes. Single or double quotes both work fine, but here we use single quotes because it's faster to type.



In [155]:
baby_df['Birth.Weight']

In the above example, the first element in the `Birth.Weight` column is the integer 120. The corresponding index is 0.
A `Series` object is basically an array with indices for each data point (note that a series' indices can be anything and do not necessarily have to start with 0). 

A `pandas`  `DataFrame` can be thought of as a collection of Series, all of which have the same index. 
The resulting `DataFrame` consists of columns where each column is a `Series` and each row has a unique index.


In [156]:
baby_df.head()

### 4.3. Getting the shape of a table

The number of rows and columns in a `DataFrame` can be accessed together using the  `.shape` attribute. 
Notice that the index is not counted as a column.

In [157]:
baby_df.shape

To get just the number of rows, we want the 0th element.

In [158]:
baby_df.shape[0]

For just the number of columns, we want the 1st element.

In [159]:
baby_df.shape[1]

### 4.4. Indices

The row labels of a `DataFrame` are collectively called the index. 
It helps to identify each row. 
By default, the index values are the row numbers, with the first row having index 0.

In [160]:
baby_df.head()

We can access the index of a `DataFrame` by calling `DataFrame.index`.

In [161]:
baby_df.index

That doesn't seem too helpful. 
We can access the values of the index using `.values`.

In [162]:
baby_df.index.values

In addition, we can set the index to whatever we want it to be. 
So, instead of index going from 0 to 1173, we can change it to go from 1 to 1174.

In [163]:
baby_df.set_index(np.arange(1, 1175))

### 4.5. Selecting columns

Sometimes the entire dataset contains too many columns, and we are only interested in some of the columns. 
In these situations, we would want to be able to select and display a subset of the columns from the original table. 

We can do this using the following syntax: `<dataframe>[<list of columns we want to select>]`

In [164]:
# Selects the columns "Birth.Weight" and "Maternal.Age" with all rows
baby_df[['Birth.Weight', 'Maternal.Age']]

Notice the syntax of selecting multiple columns, and how the following doesn't work. You need to specify the columns you want as a list *within* the outer brackets.

In [165]:
# This doesn't work
# baby_df['Birth.Weight', 'Maternal.Age']

### 4.6. Getting a value

What if you want to single out one entry of your entire table? 
This often occurs when we want the max or min value after sorting the table, for example:
- *How many gestational days did the baby corresponding to the oldest mother go through?*
- *How heavy was the baby that went through the longest gestational days?*

To do this, we would first somehow sort the table, and then select the 0th element in the column of interest.
In the code below, we show how to get the 0th element in the `Birth.Weight` column. The `.values` part retrieves the values in the series as an array. From here, you've already learned how to index into an array (simply doing `[0]` for the 0th element).

In [208]:
baby_df['Birth.Weight'].values[0]

## 5. Techniques in `pandas`

### 5.1. Filtering and boolean indexing

Sometimes, we would like to filter a table by only returning rows that satisfy a specific condition.
We can do this in `pandas` by "boolean indexing". 
The expression below returns a boolean series where an entry is `True` if it satisfies the condition and `False` if it doesn't.

In [167]:
baby_df['Birth.Weight'] > 120

If we want to filter our `pandas` dataframe for all rows that satisfy `'Birth.Weight'` > 120, we can use the boolean series when indexing the table. 
The idea is that we only want the rows where the "boolean index" is `True`. 

In [168]:
# Select all rows that are True in the boolean series baby_df['Birth.Weight'] > 120
baby_df[baby_df['Birth.Weight'] > 120]


Here are a few more examples:

In [169]:
# Return all rows where Maternal.Height is greater than or equal to 63
baby_df[baby_df['Maternal.Height'] >= 63]

In [170]:
# Return all rows where Maternal.Smoker is True
baby_df[baby_df['Maternal.Smoker'] == True]

### 5.2. Filtering on multiple conditions

We can also filter on multiple conditions. 
If we want rows where each condition is true, we separate our criterion by the `&` symbol, where `&` represents *and*.

`df[(boolean series 1) & (boolean series 2) & (boolean series 2)]`

If we just want one of the conditions to be true, we separate our criterion by `|` symbols, where `|` represents *or*.

`df[(boolean series 1) | (boolean series 2) | (boolean series 2)]`

In [171]:
# Select all rows where Gestational.Days are above or equal to 270, but less than 280
baby_df[(baby_df['Gestational.Days'] >= 270) & (baby_df['Gestational.Days'] < 280)]

Next, we will be working with the `cones` table, a very small example table about ice cream flavors.

In [172]:
cones_df = pd.read_csv('cones.csv')
cones_df

### 5.3. Inserting new columns

We have new information about the types of cones for each ice cream! 
Suppose your friend tells you information on what types of cones each ice cream comes with. 
We want to add this as a column to our data. 

In [173]:
type_of_cone = ["Waffle", "Sugar", "Sugar", "Waffle", "Waffle", "Sugar"]

There are many ways to add a new column to a dataframe in `pandas` — the easiest way is shown below. 
The text string inside the bracket denotes the new column name, and you assign it to the list containing the new column. 
 

In [174]:
cones_df['Type of Cone'] = type_of_cone
cones_df

### 5.4. Sorting columns

What if we want to know the flavor of the most expensive ice-cream? In this case we would like to sort the table by the "Price per Cone" column.

[`df.sort_values`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html) is the function in `pandas` that sorts the dataframe by a specified column. 

The argument to specify sorting order with `df.sort_values` is `ascending = True/False`. By default, `pandas` sets `ascending=True`.

In [176]:
# Sorts cones_df in ascending order of 'Price per Cone'
cones_df.sort_values('Price')

In [177]:
# Sorts cones_df in descending order of 'Price per Cone'
cones_df.sort_values('Price', ascending=False)

We can also sort a dataframe by its index using [`sort_index`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_index.html). The function works very similarly as `sort_values`, but does not require us to pass in a column since we are sorting the index.

In [178]:
cones_df = cones_df.sort_index()
cones_df

### 5.5. Groupby and aggregation (Edit note: not as positive on this one but pretty sure we don't need this)

Suppose we wanted to find the average price for each ice cream flavor.

Grouping in `pandas` is done with the [groupby](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) method.
This process is slightly more involved because calling `groupby` returns a "groupby" object, which we then use to can aggregate on a column. 

In [180]:
# A groupby object is returned
cones_df[['Flavor', 'Price']].groupby('Flavor')

We can then finish our grouping by using an aggregating function like `.mean`.

In [181]:
cones_df[['Flavor', 'Price']].groupby('Flavor').mean()

Note that the column we group by automatically becomes the index of the new dataframe. We can prevent this by setting `as_index=False` in our call to `groupby`.

In [182]:
cones_df[['Flavor', 'Price']].groupby('Flavor', as_index=False).mean()

Typical functions you can use to aggregate include `.mean`, `.median`, `.sum`, `.max`, and `.min`.

### 5.6. An exercise in `pandas`

In this exercise, you will use some common functions in `pandas` that are featured above.

**Question 5.6.1.** Read in the table `gdp.csv`, storing into the variable `gdp`.

<!--
BEGIN QUESTION
name: q5_6_1
-->

In [223]:
gdp = ...
gdp

In [None]:
grader.check("q5_6_1")

The three variables in `gdp` that we are interested in are the following:

1. `cn` $\Rightarrow$ Capital Stock in millions of USD
2. `cgdpe` $\Rightarrow$ Expenditure-side Real GDP in millions of USD
3. `emp` $\Rightarrow$ Number of Persons employed in millions

**Question 5.6.2.** 
Select the columns `country`, `year`, `cn`, `cgdpe`, `emp`, from the dataframe called `gdp`. Call the new table `gdp2` and display its first five rows. 

<!--
BEGIN QUESTION
name: q5_6_2
-->

In [225]:
gdp2 = ...
gdp2.head()

In [None]:
grader.check("q5_6_2")

**Question 5.6.3.** 
What is the most recent year in the data?

*Hint*: Use `df.sort_values(...)`.

<!--
BEGIN QUESTION
name: q5_6_3
-->

In [226]:
sorted_cleaned_gdp = ...
latest_year = ...
latest_year

In [None]:
grader.check("q5_6_3")

**Question 5.6.4.** 
Notice that there are a lot of -1 values. This dataset uses -1 to indicate missing data for a given country-year combination, so we don't really care about these rows.
Filter out all the rows in which the GDP, employment level, or capital stock was recorded as -1 and store the corresponding table to the variable `cleaned_gdp`.

<!--
BEGIN QUESTION
name: q5_6_4
-->

In [228]:
cleaned_gdp = ...
cleaned_gdp

In [None]:
grader.check("q5_6_4")

**Question 5.6.5**: 
Create a two-column table called `min_years` from `my_countries`. 
Its first column will be called `country` and the second `year`, referring to the earliest year. 
It should contain all of the countries that appear in `cleaned_gdp`, sorted in alphabetical order, with the earliest year they appear in the dataset where the `cgdpe`, `cn`, and `emp` columns are not -1.

<!--
BEGIN QUESTION
name: q5_6_5
-->

In [229]:
min_years = ...
min_years = ...
min_years

In [None]:
grader.check("q5_6_5")

**Question 5.6.6.** 
For the rest of this question, we will only investigate 2 countries: the United States and Singapore. 
Assign `my_countries` corresponding to the rows of `cleaned_gdp` that refer to either country.

<!--
BEGIN QUESTION
name: q5_6_6
-->

In [231]:
my_countries = ...
my_countries

In [None]:
grader.check("q5_6_6")

**Question 5.6.7.** 
Compute the GDP per employed-person and add that as a column called `gdp_pc` to `my_countries`.

*Hint*: You can divide a column by another column, which will do element-wise division. This works with other operations too like addition and so on.

<!--
BEGIN QUESTION
name: q5_6_7
-->

In [235]:
...
my_countries

In [None]:
grader.check("q5_6_7")

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export("ps00.ipynb")