In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab3.ipynb")

<img src="img/dsci511_header.png" width="600">

# Lab 3: Introduction to NumPy and Pandas

## Instructions

rubric={mechanics:4}

Follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions/).

In [None]:
import time
import numpy as np
import pandas as pd
import math
from random import gauss
import matplotlib.pyplot as plt
%config InlineBackend.figure_formats = ['svg']
plt.rcParams.update({'font.size': 14, 'axes.labelweight': 'bold'})

## Exercise 1: NumPy warm-up

### 1.1
rubric={autograde:1}

Create the following array and save it as a variable `ex1_1`:

```python
array([[1, 2],
       [3, 4],
       [5, 6]])
```

In [None]:
ex1_1 = ...

In [None]:
grader.check("q1_1")

### 1.2
rubric={autograde:1}

Reshape the following `arr` into shape `(6, 1)` and save the result as `ex1_2`.

In [None]:
arr = np.array([[1, 2], [3, 4], [5, 6]])
ex1_2 = ...

In [None]:
grader.check("q1_2")

### 1.3
rubric={autograde:1}

Create a `(6, 6)` array by multiplying the following `arr` by itself (**hint**: leverage broadcasting here). The first element (top left corner) should be 1, and the last element (bottom right corner) should be 36.

In [None]:
arr = np.arange(1, 7)[np.newaxis, :]
ex1_3 = ...

In [None]:
grader.check("q1_3")

### 1.4
rubric={autograde:1}

"Center" the following `arr` by subtracting the mean from every element, then flatten it to shape `(20,)` and sort it in descending order.

In [None]:
arr = np.arange(1, 21).reshape(4, 5)
ex1_4 = ...

In [None]:
grader.check("q1_4")

## Exercise 2: Vectorizing loops

### 2.1
rubric={accuracy:2,quality:2}

The following mathematical function simulates a stock price using geometric Brownian motion. In this model, price in the next timestep $S_{t+1}$ depends on the current price $S_{t}$ as follows:

$$S_{t+1} = S_{t} \exp \left( -0.5\sigma^2 + \sigma Z \right)$$

where $Z$ is a random number drawn from a standard [Gaussian distribution](https://en.wikipedia.org/wiki/Normal_distribution), and $\sigma$ is the _volatility_ of the stock (i.e., if $\sigma=0$ then the stock price never changes, whereas if $\sigma$ is large the stock price can change wildly).

Here, we have the Python function `simulate(S0, sigma, T)` that computes simulated stock prices according to the above given formula, and stores price values at each timestep as elements inside a Python list. This list is eventually returned by the function as the output:

In [None]:
def simulate(S0=1.0, sigma=0.1, T=1000):
    """
    Simulates a stock price using geometric Brownian motion, 
    and returns a list containing the results.

    Parameters
    ----------
    S0 : float 
        the initial stock price
    sigma : float 
        the volatility, should be non-negative
    T : int
        the total number of time steps

    Returns
    -------
    list
        The list containing stock prices at each time step 

    Examples
    --------
    >>> simulate(3.9, 0.5, 4)
    [3.9, 3.16886711, 1.06006414, 0.79470699]
    """
    
    S = [S0]
    # append to list for each time step
    for t in range(1, T):
        Z = gauss(0, 1)
        S.append(S[-1] * math.exp(-0.5 * sigma ** 2 + sigma * Z))

    return S

Your task here is to re-write this function as a new function called `simulate_np(S0, sigma, T)` using NumPy such that **you don't explicitly use for loops**.

Let's do some math first to simplify things a bit. Let's denote $\theta = \exp \left( -0.5\sigma^2 + \sigma Z \right)$, then:

$$S_{1} = S_{0} \theta_0$$

$$S_{2} = S_{1} \theta_{1} = S_{0} \theta_0 \theta_{1}$$

$$S_{3} = S_{2} \theta_{2} = S_{0} \theta_0 \theta_{1} \theta_{2}$$

$$\dots$$

$$S_{T} = S_{0} \prod_{t=0}^{T-1} \theta_{t}$$

- $\theta$ does not depend on the stock price at time `t`, so you can create $\prod_{t=0}^{T-1} \theta_{t}$ without using a loop.

- You can create an array of `Z`s with `np.random.randn()`.

- You can calculate the cumulative product of an array using the `cumprod()` method.

- Your function should return an `ndarray` object instead of a `list`.

- Don't forget to adapt the docstring of `simulate()` to your new function `simulate_np()`.

- You are given some plotting code to test out your function.

<!-- BEGIN QUESTION -->



In [None]:
def simulate_np():
    ...
...

In [None]:
# Code you can use for plotting your function
price = simulate_np(S0=1, sigma=0.1, T=1000)
plt.plot(price)
plt.xlabel("Time")
plt.ylabel("Stock price")

<!-- END QUESTION -->

### 2.2
rubric={accuracy:2}

The goal of this question is to help drive home the power of vectorized array operations in NumPy vs iteration in base Python.

Write code to record how long it takes for `simulate()` and your new-and-improved `simulate_np()` to run 10,000 times (using their default values of `S0=1, sigma=0.1, T=1000`).

You can use the `time` module to help calculate the time it takes (in [real time](https://communities.sas.com/t5/SAS-Programming/Real-Time-vs-CPU-time/td-p/287743#:~:text=Real%20Time%20is%20the%20actual,the%20step%20utilises%20CPU%20resources.)) for your code to execute.

**Hints:**

- The `time` module has already been imported for you at the start of the lab.

- `time.time()` gives you the current clock time (so you can save a variable `start = time.time()` before you code and `end = time.time()` after your code and compare the difference).

<!-- BEGIN QUESTION -->



In [None]:
...

<!-- END QUESTION -->

## Exercise 3: NumPy wrangling

### 3.1
rubric={accuracy:1}

Below is a 400 x 400 pixel image of the UBC logo, imported and displayed with `matplotlib` (one of the most popular Python plotting packages).

In [None]:
image = plt.imread('img/ubc.jpeg')[:, :, 0]
plt.imshow(image, cmap='gray')

print(f"Image shape: {image.shape}")
print(f"Max. pixel value: {image.max()}")
print(f"Min. pixel value: {image.min()}")

**Note:** The `cmap=gray` parameter in `plt.imshow()` is to force `matplotlib` to interpret the array as grayscale values (instead of showing an irrelevant colormap by default).

As you can see, `image` is just a NumPy array of size `(400, 400)`, with values ranging from 0 to 255.

Your task is to write a code that **flips the image about both the horizontal and vertical axes**, so that it looks as if we rotated the image 180 degrees clockwise:

![](img/ubc_flipped.jpeg)

- Do not overwrite the `image` variable, we will use it in the next few questions.

- NumPy has helpful functions for "flipping" arrays. You can try typing `np.flip` to see what the auto-complete reveals.

- **Don't forget to show your resulting image in the output using `plt.imshow()`.**

<!-- BEGIN QUESTION -->



In [None]:
...

<!-- END QUESTION -->

### 3.2
rubric={accuracy:2}

Suppose that you want to prepare this image to post on Instagram, and you think a 20-pixels-wide black border will really make it stand out. Add a 20-pixels-wide black border to the image so that it looks like this:

![](img/ubc_border.jpeg)

**Hints:**

- The image including the border should remain 400 x 400 pixels (i.e., you are adding the border within the image), **do not** add it to the outside and make the image 440 x 440 pixels.

- Black pixels have a value of `0` (i.e., they have 0 brightness).

- There are many ways to solve this question so do it however you like, but `np.pad()` might be super helpful to make your solution shorter.

- **Don't forget to show your resulting image in the output using `plt.imshow()`.**

<!-- BEGIN QUESTION -->



In [None]:
...

<!-- END QUESTION -->

### 3.3 (OPTIONAL)
rubric={accuracy:1% of total grade}

In DSCI 572 you'll learn about [Convolutional Neural Networks](https://en.wikipedia.org/wiki/Convolutional_neural_network) (CNN), which are often used when working with image data. A common operation in a CNN architecture is [pooling](https://cs231n.github.io/convolutional-networks/#pool), in which you reduce the size of an image by looking at a small window of pixels, say a 4 x 4 window of pixels, and representing that window using e.g., the max/min/mean value of the pixels in the window.

Below is an example of mean pooling, transforming a 6 x 6 image into a 3 x 3 image by taking the mean of 2 x 2 pixel windows:

<img src="https://stanford.edu/~shervine/teaching/cs-230/illustrations/average-pooling-a.png?58f9ab6d61248c3ec8d526ef65763d2f" width="400">

Source: [stanford.edu](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-convolutional-neural-networks)

Let's implement pooling in NumPy here to get a feel of how it works. We can do it by reshaping our image into `n x n` windows and then calculating the `.max()` of each window. Your task here is to implement **mean** pooling on the UBC logo image using a `10 x 10` window, resulting in an image that will look like this:

![](img/ubc_mean_pool.jpeg)

**Hints:**

- There are plenty of ways you could solve this question. One way is to start by reshaping each axis of the image into shape `(40, 10)` to end up with a 4D array of shape `(40, 10, 40, 10)`.

- Then apply `.mean()` to the reshaped data on both axes of size 10. You can specify multiple axes to operate on simultaneously by passing a tuple to the `axis=` parameter of the `mean()` method.

- Play around with you code to get a feel for what pooling does. What happens if you increase/decrease the pooling window size? Also feel free to try using `min()` or `max()` methods, for example: 

![](img/ubc_min_pool.jpeg)

- See [here](https://stackoverflow.com/a/42463514) for more help.

- **Don't forget to show your resulting image in the output using `plt.imshow()`**

<!-- BEGIN QUESTION -->



In [None]:
...

<!-- END QUESTION -->

## Exercise 4: Reading data with Pandas 

rubric={autograde:5}

Reading data from external sources will be one of the most common things you do in Pandas, so it's good to get some practice doing this. We will use Pandas to read in variants of the Palmer penguins dataset that you've already seen in DSCI 523. In this data set, there are measures on the three species of penguins shown in this illustration below. You can learn more about the penguins and this dataset [here](https://allisonhorst.github.io/palmerpenguins/).

<img src="img/penguins.png" width="500">

Source: [allisonhorst.github.io](https://github.com/allisonhorst/palmerpenguins/blob/master/man/figures/lter_penguins.png?raw=true)

- In the `data` directory of this repository, there are 5 different versions of the same dataset in different file formats. Your task is to read each of these into a Pandas dataframe, **such that all dataframes look exactly like the one created by reading `penguins.csv`**.

- **Hint:** Some dataframes have redundant lines in the beginning of the file, some are separated by characters other than comma, some have extra columns, some don't have the right column names, etc. You should take these into account and read in the data such that they all look alike.

- **Hint:** If you want, you can compare which rows of two dataframes are not equal using something like `penguins_csv[~penguins_csv.eq(penguins_tsv).any(axis=1)]`.

- Tests are provided to make sure you've correctly read each file in. The tests use `df1.equals(df2)` to check for differences between a dataframe `df1` and another one `df2`.

**Note:** If python complains about the package `openpyxl` not being available on your system, you can run the following command in a terminal to install it:

```shell
conda install -y openpyxl
```

In [None]:
# read in file named penguins.csv
penguins_csv = ...
# read in file named penguins.tsv
penguins_tsv = ...
# read in file named penguins-meta-data.csv
penguins_meta_data = ...
# read in file named penguins2.csv
penguins2 = ...
# read in file named penguins.xlsx
penguins_excel = ...
...
...

In [None]:
grader.check("q4")

## Exercise 5: Treasure hunt with Pandas

Learning how to index with Pandas is kind of like having to eat your vegetables: it's not the funnest thing, but you need to know it! In this exercise, I'll provide you with clues that you need to use to index a dataframe and find the "Treasure". For example, if the clue is "The first ten rows" then you might try:

```python
>>> df.iloc[:10]
```

The above is just an example, but the result of your indexing should always be the string `'TREASURE'`.

### 5.1
rubric={autograde:1}

Load in the file `treasure_hunt.csv` as a dataframe called `treasure_df`. Read in the first column as the dataframe index column and the first row as the dataframe columns.

In [None]:
treasure_df = ...

In [None]:
grader.check("q5_1")

### 5.2
rubric={autograde:1}

**Clue:** C20, R1

In [None]:
ex5_2 = ...
ex5_2

In [None]:
grader.check("q5_2")

### 5.3
rubric={autograde:1}

**Clue:** thirteenth column, thirteenth row (**hint:** recall that indexing starts at 0!)

In [None]:
ex5_3 = ...
ex5_3

In [None]:
grader.check("q5_3")

### 5.4
rubric={autograde:1}

**Clue:** R28, tenth column

In [None]:
ex5_4 = ...
ex5_4

In [None]:
grader.check("q5_4")

### 5.5
rubric={autograde:1}

**Clue:** C26, R19 to R26

**Hint:** You might need to do a bit of extra wrangling (e.g. joining stuff, maybe?) to get a result of "TREASURE" here...

In [None]:
ex5_5 = ...
ex5_5
...

In [None]:
grader.check("q5_5")

### 5.6
rubric={autograde:1}

**Clue:** The first rows of C29 *plus* the first row of C30

In [None]:
ex5_6 = ...
ex5_6

In [None]:
grader.check("q5_6")

### 5.7
rubric={autograde:1}

**Clue:** The first row of all columns that **do not** have a "Z" in the last row.

**Hint:** You might need to do a bit of extra wrangling to get a result of "TREASURE" here...

In [None]:
ex5_7 = ...
ex5_7
...

In [None]:
grader.check("q5_7")

### 5.8
rubric={autograde:1}

**Clue:** C1 for C1 < C2 (remember, in Python, comparison operators work on string too, e.g., `"a" < "b" = True` because `a` occurs before `b` alphabetically. Also, don't forget about `.query()`!)

**Hint:** You might need to do a bit of extra wrangling to get a result of "TREASURE" here...

In [None]:
ex5_8 = ...
ex5_8
...

In [None]:
grader.check("q5_8")

## Exercise 6: Mind the gap

In this exercise, we'll practice some basic Pandas operations on the [Gapminder dataset](https://www.gapminder.org/about/) which you are already familiar with from your DSCI 523 labs. Gapminder is an educational foundation that aims to use data to unbiasedly describe trends in health and socioeconomics; it is a great source of geographical, socioeconomic, and health data - a subset of which we'll be exploring here. In particular, we'll be exploring a dataframe of the following features in this exercise:

|     | country     | continent | year | lifeExp | pop      | gdpPercap  |
|:---:|:-----------:|:---------:|:----:|:-------:|:--------:|:----------:|
|  0  | Afghanistan | Asia      | 1952 | 28.801  | 8425333  | 779.445314 |
|  1  | Afghanistan | Asia      | 1957 | 30.332  | 9240934  | 820.853030 |
|  2  | Afghanistan | Asia      | 1962 | 31.997  | 10267083 | 853.100710 |
|  3  | Afghanistan | Asia      | 1967 | 34.020  | 11537966 | 836.197138 |
|  4  | Afghanistan | Asia      | 1972 | 36.088  | 13079460 | 739.981106 |
| ... |     ...     |    ...    | ...  |   ...   |   ...    |    ...     |

### 6.1
rubric={accuracy:1}

Read the gapminder dataset into a dataframe called `df` from this url: <https://raw.githubusercontent.com/jstaf/gapminder/master/gapminder/gapminder.csv>

<!-- BEGIN QUESTION -->



In [None]:
df = ...
df

<!-- END QUESTION -->

### 6.2
rubric={accuracy:1}

Which continent has the most observations in the gapminder dataset? You can leave your answer as the output of a dataframe operation (**hint:** `.value_counts()`).

<!-- BEGIN QUESTION -->



In [None]:
...

<!-- END QUESTION -->

### 6.3
rubric={accuracy:2}

What are the minimum and maximum life expectancies in the dataset, and what are the corresponding countries and the years? (**hint:** `.argmin()`/`.argmax()`)

<!-- BEGIN QUESTION -->



In [None]:
...

<!-- END QUESTION -->

### 6.4
rubric={accuracy:2}

How much larger is the total population in this dataset in 2007 compared to 1952? You can give you answer as a float, e.g., "the population is 1.8 times larger in 2007 than in 1952." (**hint:** you can use `.query()` to subset the dataframe for 1952 and then calculate the `.sum()` of the population, then repeat for 2007).

<!-- BEGIN QUESTION -->



In [None]:
...

<!-- END QUESTION -->

### 6.5
rubric={accuracy:2}

What is the mean life expectancy of countries with the highest 50% of `gdpPercap` and countries with the lowest 50% of `gdpPercap`? (**hint:** try combining `.query()` and `.median()`)

<!-- BEGIN QUESTION -->



In [None]:
high = ...
low = ...
print(f"Life expectancy of wealthy countries: {high:.0f} years")
print(f" Life expectancy of poorer countries: {low:.0f} years")

<!-- END QUESTION -->

