# Problem Set 0 - Python & Jupyter Notebooks

In this problem set, you will learn how to:

* navigate jupyter notebooks (like this one).
* write and evaluate some basic *expressions* in Python, the computer language of the course.
* learn data manipulation using the `pandas` package.

For reference, you might find it useful to read [Chapter 3 of the Data 8 textbook](http://www.inferentialthinking.com/chapters/03/programming-in-python.html), [Chapter 1](https://www.inferentialthinking.com/chapters/01/what-is-data-science.html) and [Chapter 2](https://www.inferentialthinking.com/chapters/02/causality-and-experiments.html) are worth skimming as well.

## 1. Jupyter Notebooks

This webpage is called a Jupyter notebook. A notebook is an interactive computing environment to write programs and view their results, and also to write text. A notebook is thus an editable computer document in which you can write computer programs; view their results; and comment, annotate, and explain what is going on.

### 1.1. Text cells

In a notebook, each rectangle containing text or code is called a *cell*.

Text cells (like this one) can be edited by double-clicking on them. They're written in a simple format called  [markdown](http://daringfireball.net/projects/markdown/syntax) to add formatting and section headings. 
You don't need to learn too much about markdown.

After you edit a text cell, click the "run cell" button at the top that looks like `▶` in the toolbar at the top of this window, or hold down `shift` + press `return`, to confirm any changes to the text and formatting. 

**Question 1.1.1.** This paragraph is in its own text cell.  Try editing it so that this sentence is the last sentence in the paragraph, and then click the "run cell" ▶| button or hold down `shift` + `return`.  This sentence, for example, should be deleted.  So should this one.

### 1.2. Code cells

Other cells contain code in the Python 3 language. Running a code cell will execute all of the code it contains.

To run the code in a code cell, first click on that cell to activate it.  It'll be highlighted with a little green or blue rectangle.  Next, either press `▶` or hold down `shift` + press `return`.

Try running this cell:

In [46]:
print("Hello, world!")

The fundamental building block of Python code is an expression. Cells can contain multiple lines with multiple expressions. When you run a cell, the lines of code are executed in the order in which they appear. Every `print` expression prints a line. Run the next cell and notice the order of the output.

In [47]:
print("First this line is printed,")
print("and then this one.")

### 1.3. Writing notebooks

You can use Jupyter notebooks for your own projects or documents.  When you make your own notebook, you'll need to create your own cells for text and code.

To add a cell, click the + button in the menu bar.  It'll start out as a text cell.  You can change it to a code cell by clicking inside it so it's highlighted, clicking the drop-down box next to the restart (⟳) button in the menu bar, and choosing "Code".

**Question 1.3.1.** Add a code cell below this one.  Write code in it that prints out:
   
    Econometrics is what econometricians do.

Run your cell to verify that it works.

### 1.4. The kernel

The kernel is a program that executes the code inside your notebook and outputs the results. In the top right of your window, you can see a circle that indicates the status of your kernel. If the circle is empty (⚪), the kernel is idle and ready to execute code. If the circle is filled in (⚫), the kernel is busy running some code. 

Next to every code cell, you'll see some text that says `In [...]`. Before you run the cell, you'll see `In [ ]`. When the cell is running, you'll see `In [*]`. If you see an asterisk (\*) next to a cell that doesn't go away, it's likely that the code inside the cell is taking too long to run, and it might be a good time to interrupt the kernel (discussed below). When a cell is finished running, you'll see a number inside the brackets, like so: `In [1]`. The number corresponds to the order in which you run the cells; so, the first cell you run will show a 1 when it's finished running, the second will show a 2, and so on. 

You may run into problems where your kernel is stuck for an excessive amount of time, your notebook is very slow and unresponsive, or your kernel loses its connection. If this happens, try the following steps:
1. At the top of your screen, click **Kernel**, then **Interrupt**.
2. If that doesn't help, click **Kernel**, then **Restart**. If you do this, you will have to run your code cells from the start of your notebook up until where you paused your work.
3. If that doesn't help, restart your server. First, save your work by clicking **File** at the top left of your screen, then **Save and Checkpoint**. Next, click **Control Panel** at the top right. Choose **Stop My Server** to shut it down, then **Start My Server** to start it back up. Then, navigate back to the notebook you were working on. You'll still have to run your code cells again.

**IMPORTANT:** If you leave your notebook alone for a while, the server will "clear" the code you've run and you'll have to run your notebook from the very top again. If you've been away from your computer and after coming back you get a `<something> not defined` error this is almost certainly what happened.

### 1.5. Libraries

There are many add-ons and extensions to the core of Python that are useful to using it to get work done. They are contained in what are called libraries. Let's tell Python to install some of them. Run the code cell below to do so.

The following lines import `numpy` and `pandas`. 
Note that we imported `pandas` as `pd`, `numpy` as `np`. 
This means that, for example, when we call functions in `pandas`, we will always reference them with `pd` first. `pd` is like a shortcut.

In [91]:
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

## 2. Python: Numbers & Variables

In addition to representing commands to print out lines, expressions can represent numbers and methods of combining numbers. The expression `3.2500` evaluates to the number 3.25. (Run the cell and see.)

In [49]:
3.2500

Notice that we didn't have to `print`. When you run a notebook cell, if the last line has a value, then Jupyter helpfully prints out that value for you. However, it won't print out prior lines automatically.

In [50]:
print(2)
3
4

Above, you should see that 4 is the value of the last expression, 2 is printed, but 3 is not because it was neither printed nor last.

### 2.1. Arithmetic

The line in the next cell subtracts.  Its value is what you'd expect.  Run it.

In [51]:
3.25 - 1.5

Many basic arithmetic operations are built into Python.  The Data 8 textbook section on [Expressions](http://www.inferentialthinking.com/chapters/03/1/expressions.html) describes all the arithmetic operators used in the course. 
The common operator that differs from typical math notation is `**`, which raises one number to the power of the other. So, `2**3` stands for $2^3$ and evaluates to 8. 

The order of operations is the same as what you learned in elementary school, and Python also has parentheses.  For example, compare the outputs of the cells below. The second cell uses parentheses for a happy new year!

In [52]:
5+6*5-6*3**2*2**3/4*7

In [53]:
5+(6*5-(6*3))**2*((2**3)/4*7)

In standard math notation, the first expression is

$$5 + 6 \times 5 - 6 \times 3^2 \times \frac{2^3}{4} \times 7,$$

while the second expression is

$$5 + (6 \times 5 - (6 \times 3))^2 \times (\frac{(2^3)}{4} \times 7).$$

**Question 2.1.1.** Write a Python expression in this next cell that's equal to $5 \times (3 \frac{10}{11}) - 50 \frac{1}{3} + 2^{.5 \times 22} - \frac{7}{33} + 4$.

By "$3 \frac{10}{11}$" we mean $3+\frac{10}{11}$, not $3 \times \frac{10}{11}$.

Replace the ellipses (`...`) with your expression.  Try to use parentheses only when necessary.

*Hint:* The correct output should start with a familiar number.

In [54]:
...

### 2.2. Variables

In natural language, we have terminology that lets us quickly reference complicated concepts.

In Python, we do this with *assignment statements*. An assignment statement has a name on the left side of an `=` sign and an expression to be evaluated on the right.

In [55]:
ten = 3 * 2 + 4

When you run that cell, Python first computes the value of the expression on the right-hand side, `3 * 2 + 4`, which is the number 10.  Then it assigns that value to the name `ten`.  At that point, the code in the cell is done running.

After you run that cell, the value 10 is bound to the name `ten`:

In [56]:
ten

The statement `ten = 3 * 2 + 4` is not asserting that `ten` is already equal to `3 * 2 + 4`, as we might expect by analogy with math notation.  Rather, that line of code changes what `ten` means; it now refers to the value 10, whereas before it meant nothing at all.

**IMPORTANT:** The order you run the cells matters. Even though the cell that says `ten` happens after the cell that says `ten = 3 * 2 + 4` in this notebook, you still have to run the cell that says `ten = 3 * 2 + 4` before you run the cell that says `ten`. If you start up this notebook and the first cell you run is the `ten` cell, you'll get an error that says something like `ten not defined`. That's because you haven't defined the variable `ten` above.

**Question 2.2.1.** Try writing code that uses a name (like `eleven`) that hasn't been assigned to anything.  You'll see an error!

In [57]:
...

A common pattern in Jupyter notebooks is to assign a value to a name and then immediately evaluate the name in the last line in the cell so that the value is displayed as output. 

In [58]:
close_to_pi = 355/113
close_to_pi

Another common pattern is that a series of lines in a single cell will build up a complex computation in stages, naming the intermediate results.

In [59]:
semimonthly_salary = 841.25
monthly_salary = 2 * semimonthly_salary
number_of_months_in_a_year = 12
yearly_salary = number_of_months_in_a_year * monthly_salary
yearly_salary

Names in Python can have letters (upper- and lower-case letters are both okay and count as different letters), underscores, and numbers.  The first character can't be a number (otherwise a name might look like a number).  And names can't contain spaces, since spaces are used to separate pieces of code from each other.

Other than those rules, what you name something doesn't matter *to Python*.  For example, this cell does the same thing as the above cell, except everything has a different name:

In [60]:
a = 841.25
b = 2 * a
c = 12
d = c * b
d

**However**, names are very important for making your code *readable* to yourself and others.  The cell above is shorter, but it's totally useless without an explanation of what it does.

### 2.3. Checking your code 

Our notebooks include built-in *tests* to check whether your work is correct. Sometimes, there are multiple tests for a single question, and passing all of them is required to receive credit for the question. Please don't change the contents of the test cells.

Run the next code cell to initialize the tests:

In [61]:
# Using the pound symbol tells Python to ignore the rest of this line
# This was we can make comments in code
# Run this code cell to initialize the autograder
import otter
grader = otter.Notebook()

Go ahead and attempt Question 2.3.1. Running the cell directly after it will test whether you have assigned `seconds_in_a_decade` correctly. 
If you haven't, this test will tell you the correct answer. Resist the urge to just copy it, and instead try to adjust your expression.

**Question 2.3.1.** Assign the name `seconds_in_a_decade` to the number of seconds between midnight January 1, 2010 and midnight January 1, 2020. Note that there are two leap years in this span of a decade. A non-leap year has 365 days and a leap year has 366 days.

<!--
BEGIN QUESTION
name: q2_3_1
-->

In [62]:
# Using the pound symbol tells Python to ignore the rest of this line.
# This way we can make comments in code.
# Change the next line so that it computes the number of seconds in a decade 
# and assigns that number the name, seconds_in_a_decade.

seconds_in_a_decade = ...

# We've put this line in this cell so that it will print the value you've given 
# to seconds_in_a_decade when you run it. You don't need to change this.

seconds_in_a_decade

In [None]:
grader.check("q2_3_1")

If the autograder found that you had set the right variable(s) to the proper value(s) that it expected, well and good: you are probably not far off track. If you failed any of the tests, go back and try again.

## 3. Data

We will be using `pandas` in this course for data manipulation and analysis. `pandas` stores data in the form of something called a `DataFrame`, which is really just another word for table.

Recall that we typically use `pd` as the shortcut for `pandas`.

### 3.1. Reading in a dataset

Most of the time, the data we want to analyze will be in a separate file, typically as a `.csv` file. 
In this case, we want to read the files in and convert them into a tabular format.

`pandas` has a specific function to read in csv files called `pd.read_csv(<file_path>)`, with the same relative file path as its argument. Your dataset will take the form of a variable, as shown below where we call our dataset `baby_df`.

The `<dataframe>.head(...)` function will display the first 5 rows of the data frame by default. 
If you want to specify the number of rows displayed, you can use`dataframe.head(<num_rows>)`.
Similarly, if you want to see the last few rows of the data frame, you can use `dataframe.tail(<num_rows>)`.

In [64]:
# We will call our dataset baby_df
baby_df = pd.read_csv("baby.csv") # df is short for dataframe
baby_df.head(5)

### 3.2. Columns

You can access the values of a particular column by using `<dataframe>['<column_name>']`. 
Notice that you have to write the column name in quotes. Single or double quotes both work fine, but here we use single quotes because it's faster to type.



In [65]:
baby_df['Birth.Weight']

### 3.3. Getting the shape of a table

The number of rows and columns in a `DataFrame` can be accessed together using the  `.shape` attribute. 
Notice that the index is not counted as a column.

In [66]:
baby_df.shape

### 3.4. Selecting columns

Sometimes the entire dataset contains too many columns, and we are only interested in some of the columns. 
In these situations, we would want to be able to select and display a subset of the columns from the original table. 

We can do this using the following syntax: `<dataframe>[[<list of columns we want to select>]]`.

Note that there must be two sets of square brackets.

In [67]:
# Selects the columns "Birth.Weight" and "Maternal.Age" with all rows
baby_df[['Birth.Weight', 'Maternal.Age']]

We haven't actually changed anything about the dataset by doing this.

In [68]:
baby_df.head()

To save the dataset with fewer columns, define a new variable (or re-define the same dataset variable).

In [69]:
smaller_baby_df = baby_df[['Birth.Weight', 'Maternal.Age']]
smaller_baby_df.head()

## 4. Techniques in `pandas`

### 4.1. Filtering and boolean indexing

Sometimes, we would like to filter a table by only returning rows that satisfy a specific condition.
We can do this in `pandas` by "boolean indexing". 
The expression below returns a boolean column where an entry is `True` if the row satisfies the condition and `False` if it doesn't.

In [70]:
baby_df['Birth.Weight'] > 120

If we want to filter our data for all rows that satisfy `'Birth.Weight'` > 120, we can use the above expression as if we were selecting a column, as below. 
The idea is that we only want the rows where the "boolean column" is `True`. 

In [71]:
# Select all rows that are True in the boolean series baby_df['Birth.Weight'] > 120
baby_df[baby_df['Birth.Weight'] > 120]

You can tell we're missing some rows by the fact that the index on the very left is missing some numbers.

Here are a few more examples:

In [72]:
# Return all rows where Maternal.Height is greater than or equal to 63
baby_df[baby_df['Maternal.Height'] >= 63]

Notice below that to ask if something is equal to something else we use `==`, which is different from defining a variable. That uses `=`.

In [73]:
# Return all rows where Maternal.Smoker is True
baby_df[baby_df['Maternal.Smoker'] == True]

### 4.2. Filtering on multiple conditions

We can also filter on multiple conditions. 
If we want rows where each condition is true, we separate our criterion by the `&` symbol, where `&` represents *and*.

`df[(boolean condition 1) & (boolean condition 2) & (boolean condition 2)]`

If we just want one of the conditions to be true, we separate our criterion by `|` symbols, where `|` represents *or*.

`df[(boolean condition 1) | (boolean condition 2) | (boolean condition 2)]`

In [74]:
# Select all rows where Gestational.Days are above or equal to 270, but less than 280
baby_df[(baby_df['Gestational.Days'] >= 270) & (baby_df['Gestational.Days'] < 280)]

Again though, none of this changed the original dataset.

In [75]:
baby_df.head()

To save a new dataset with your desired filtering, you should define new variables.

In [76]:
gestational_range_baby_df = baby_df[(baby_df['Gestational.Days'] >= 270) & (baby_df['Gestational.Days'] < 280)]
gestational_range_baby_df.head()

### 4.3. Inserting new columns

Suppose we know that the device used to measure mother height was not calibrated correctly and underestimated by 2 units for each mother. To correct this, we want to add 2 to each observation of mother height. The code below shows how to do this. The syntax is similar to defining a new variable. Notice we can do math with columns as if they were numbers, and the math is appplied to each element in the column.

In [77]:
baby_df['Adjusted Height'] = baby_df['Maternal.Height'] + 2
baby_df.head()

### 4.4. An exercise in `pandas`

In this exercise, you will use some common functions in `pandas` that are featured above.

**Question 4.4.1.** Read in the table `gdp.csv`, storing into the variable `gdp`.

<!--
BEGIN QUESTION
name: q4_4_1
-->

In [78]:
gdp = ...
gdp

In [None]:
grader.check("q4_4_1")

The three variables in `gdp` that we are interested in are the following:

1. `cn` $\Rightarrow$ Capital Stock in millions of USD
2. `cgdpe` $\Rightarrow$ Expenditure-side Real GDP in millions of USD
3. `emp` $\Rightarrow$ Number of Persons employed in millions

**Question 4.4.2.** 
Select the columns `country`, `year`, `cn`, `cgdpe`, `emp`, from the dataframe called `gdp`. Call the new table `gdp2` and display its first five rows. 

<!--
BEGIN QUESTION
name: q4_4_2
-->

In [80]:
gdp2 = ...
gdp2.head()

In [None]:
grader.check("q4_4_2")

**Question 4.4.3.** 
Notice that there are a lot of -1 values. This dataset uses -1 to indicate missing data for a given country-year combination, so we don't really care about these rows.
Filter out all the rows in which the GDP, employment level, or capital stock was recorded as -1 and store the corresponding table to the variable `cleaned_gdp`. Use the `gdp2` table you defined above.

<!--
BEGIN QUESTION
name: q4_4_3
-->

In [82]:
cleaned_gdp = ...
cleaned_gdp

In [None]:
grader.check("q4_4_3")

**Question 4.4.4.** 
Compute the GDP per employed-person and add that as a column called `gdp_pc` to `cleaned_gdp`.

*Hint*: Remember you can divide a column by another column, and this will do element-wise division.

<!--
BEGIN QUESTION
name: q4_4_4
-->

In [92]:
...
cleaned_gdp

In [None]:
grader.check("q4_4_4")

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()