# Jupyter Notebook introduction

We'll start off with a very brief introduction on the basics of using Jupyter notebooks.

### Notebook cells

A notebook consists of a sequence of cells. A cell is a multi-line text input field, and its contents can be executed by typing `Shift-Enter`, or by clicking the `Run` button in the toolbar. What exactly this does depends on the type of cell. There are four types of cells: *code cells*, *markdown cells*, *raw cells* and *heading cells*. We will only focus on the first 2; code and markdown. Every cell starts off being a code cell, but its type can be changed by using a dropdown on the toolbar (which will be `Code`, initially).

In a code cell you can write *Python* code. When you run that cell (click on it and press `Shift-Enter`) the code in the cell will run, and the output of the cell will be displayed beneath the cell. Lets try out a very simple code cell below

In [None]:
x = 5
x = x + 2
print(x)

This produces the output you might expect, the exact the same result as executing that bit of *Python* code in a terminal. You can modify the contents of the code cell and run it again with `Shift-Enter` to see how the output changes.

Global variables are shared between cells. This means we can still use variables or functions from the first cell in a second cell, like so

In [None]:
y = 2 * x
print(y)

Notebooks are expected to be run top to bottom, starting with the first cell and ending with the last. **Failing to run some cells or running cells out of order is likely to result in errors.** For example, if we were to run the second cell before the first has been run the first, we would get an error saying `x` is not defined.

Before you hand in this exercise, make sure that it can run without errors from top to bottom. Test this by selecting *Kernel -> Restart & Run All* in the menu.

Now to the actual assignment:

# Pandas

In this series of exercises you will learn to use the library `pandas`. Pandas is a very popular library for storing and manipualting data. For the exercises in this notebook, we will be relying on the following sources:
- Chapter 3 from the [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)
- The official Pandas documentation: [Pandas Documentation](https://pandas.pydata.org/)

**All of these exercises are able to be done in one or two lines of code. If you feel like you need more lines of code, you should look at the available functions for `pandas` again. This might be difficult and time consuming now, but will save you _a lot_ of time in the future!**

## Series 

Start by reading the introduction of `pandas` objects [here](https://jakevdp.github.io/PythonDataScienceHandbook/03.01-introducing-pandas-objects.html).

Then, run the cell below to import the necessary libraries. In our distribution, we've also included a file named `answers.py`, which we will also import below. After each set of exercises, a function from `answers.py` is called that will check your answers and provide feedback in the form of assertions.

In [None]:
from pandas import Series
import pandas as pd
import answers

### Exercise 1

Create a `Series`-object named `income` containing the following *figures* and using the *sources* as its index.

In [None]:
from pandas import Series
import pandas as pd

income_sources = ["sales", "ads", "subscriptions", "donations"]
income_figures = [39041, 8702, 13200, 292]

income = None

# your code here

display(income)

Check your answer by running the cell below.

In [None]:
answers.test_1(income)

### Exercise 2

Create a `Series` named `expenses` that uses the expense information below. Also create a `Series` named `profits` with the profit (income minus expenses). Create a variable named `total_profit` with the sum of all profits in `profits`.

In [None]:
expense_sources = ["ads", "sales", "donations", "subscriptions"]
expense_figures = [4713, 24282, 0, 3302]

# your code here

print(total_profit)

Check your answer by running the cell below.

In [None]:
answers.test_2(expenses, profits, total_profit)

## DataFrames

### Exercise 3

Create a `DataFrame` named `skittles` with the *columns* `amount` and `rating`, using the different colors as the *index*.

|&nbsp;      | amount | rating |
|------------|--------|--------|
| **red**    | 7      | 3      |
| **green**  | 4      | 4      |
| **blue**   | 6      | 2      |
| **purple** | 5      | 4      |
| **pink**   | 6      | 3.5    |

Using `Jupyter`'s `display()` makes sure we get a nicely formatted table.

In [None]:
from pandas import DataFrame

# your code here

display(skittles)

Check your answer by running the cell below.

In [None]:
answers.test_3(skittles)

### Exercise 4

Calculate the mean `rating` and save as `skittles_average`.

In [None]:
# your code here

display(skittles_average)

Check your answer by running the cell below.

In [None]:
answers.test_4(skittles_average)

### Exercise 5

Add a new column to the skittles `DataFrame` named `score`. The score of a color is equal to `amount * rating`.

In [None]:
# your code here
display(skittles)

Check your answer by running the cell below.

In [None]:
assert "score" in skittles

## Indexing and selection

Read the [next](https://jakevdp.github.io/PythonDataScienceHandbook/03.02-data-indexing-and-selection.html) part of the reference.

### Exercise 6

For the given `DataFrame` select only columns 'a', 'c', and 'e', and rows 10, 20, 50, 60 and store the result again in the variable `frame`. As a clarification, the original `DataFrame` looks like:

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>a</th>      <th>b</th>      <th>c</th>      <th>d</th>      <th>e</th>      <th>f</th>      <th>g</th>    </tr>  </thead>  <tbody>    <tr>      <th>10</th>      <td>0.0</td>      <td>1.0</td>      <td>2.0</td>      <td>3.0</td>      <td>4.0</td>      <td>5.0</td>      <td>6.0</td>    </tr>    <tr>      <th>20</th>      <td>7.0</td>      <td>8.0</td>      <td>9.0</td>      <td>10.0</td>      <td>11.0</td>      <td>12.0</td>      <td>13.0</td>    </tr>    <tr>      <th>30</th>      <td>14.0</td>      <td>15.0</td>      <td>16.0</td>      <td>17.0</td>      <td>18.0</td>      <td>19.0</td>      <td>20.0</td>    </tr>    <tr>      <th>40</th>      <td>21.0</td>      <td>22.0</td>      <td>23.0</td>      <td>24.0</td>      <td>25.0</td>      <td>26.0</td>      <td>27.0</td>    </tr>    <tr>      <th>50</th>      <td>28.0</td>      <td>29.0</td>      <td>30.0</td>      <td>31.0</td>      <td>32.0</td>      <td>33.0</td>      <td>34.0</td>    </tr>    <tr>      <th>60</th>      <td>35.0</td>      <td>36.0</td>      <td>37.0</td>      <td>38.0</td>      <td>39.0</td>      <td>40.0</td>      <td>41.0</td>    </tr>  </tbody></table>

Select the appropriate columns and rows, such that it looks like::

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>a</th>      <th>c</th>      <th>e</th>    </tr>  </thead>  <tbody>    <tr>      <th>10</th>      <td>0.0</td>      <td>2.0</td>      <td>4.0</td>    </tr>    <tr>      <th>20</th>      <td>7.0</td>      <td>9.0</td>      <td>11.0</td>    </tr>    <tr>      <th>50</th>      <td>28.0</td>      <td>30.0</td>      <td>32.0</td>    </tr>    <tr>      <th>60</th>      <td>35.0</td>      <td>37.0</td>      <td>39.0</td>    </tr>  </tbody></table>


In [None]:
import numpy as np

frame = DataFrame(np.arange(6 * 7.).reshape((6, 7)), index=[10, 20, 30, 40, 50, 60], columns=list('abcdefg'))

# your code here

display(frame2)

Check your answer by running the cell below.

In [None]:
answers.test_6(frame2)

### Exercise 7

Masking is a very useful operation on Pandas DataFrames. Masking can also be used on Numpy arrays, and as such it will prove useful for future assignments. The book explains this concept, but very briefly. Read [the following webpage](https://jakevdp.github.io/PythonDataScienceHandbook/02.06-boolean-arrays-and-masks.html) to get a more thorough introduction into masking and use it to solve the following exercises.

First, let's get a bit of practice in. Get a truth value array for all values in `frame` lower than 11, save it in a variable named `mask`, and then print `mask`.

In [None]:
frame = DataFrame(np.arange(6 * 7.).reshape((6, 7)), index=[10, 20, 30, 40, 50, 60], columns=list('abcdefg'))

# your code here

Use the variable `mask` to change the values of all integers lower than 11 to the value 10 in the DataFrame `frame`. Make sure that the values actually are overwritten by displaying `frame`.

In [None]:
# your code here

display(frame)

Now that we've done a bit of practice, we can do something slightly more difficult. Replace all values in the data frame `frame` that are *divisible by 3* with the value *0*. Store the result in `frame`. Use the `%` (modulo) operator!

In [None]:
frame = DataFrame(np.arange(6 * 7.).reshape((6, 7)), index=[10, 20, 30, 40, 50, 60], columns=list('abcdefg'))

# your code here

display(frame)

Check your answer by running the cell below.

In [None]:
answers.test_7(frame)

## Operating on dataframes

Read about [operations in pandas](https://jakevdp.github.io/PythonDataScienceHandbook/03.03-operations-in-pandas.html).

### Exercise 8

Create a `Series` named `series_c` with the result of the calculation `series_a - series_b`. Note that `series_a` and `series_b` do not use the same indexing. Replace any missing values in `series_b` with 0. Values that are in `series_b` but not in `series_a` need not be in `series_c`.

**Hint:** An easy way to get all indices and filter on them is by using `Dataframe.index` or `Series.index`. These can be used as a _mask_ to only get the indices you want!

In [None]:
series_a = Series([500, 400, 300, 200, 100], index=["a", "b", "c ", "d", "e"])
series_b = Series([23, 46, 67, 79], index=["a", "c ", "f", "g"])
series_c = None

# your code here

display(series_c)

Check your answer by running the cell below.

In [None]:
answers.test_8(series_a, series_b, series_c)

## Map

`pandas` has it's own `map()` function! It maps a function to every element of a `Series` or `DataFrame` and gives back a new `Series` or `DataFrame`. As an example:

```Py
tokens = Series(["hello", " ", "world!"])

lengths = []
for token in tokens:
    lengths.append(len(token))
lenghts = Series(lengths)
```

Can be converted to:

```Py
tokens = Series(["hello", " ", "world!"])
lengths = tokens.map(len)
```

### Exercise 9

Convert all words in the `Series` `words` to lowercase. Use `<str>.lower()` and the `map()` function.

In [None]:
words = Series(["foo", "Bar", "baz", "QUX", "QuUuX"])

# your code here

display(words)

Check your answer by running the cell below.

In [None]:
answers.test_9(words)

### Exercise 10

Sort `frame` on column `c` in descending order and store the solution in `frame`.

In [None]:
data = [[0.691074, -1.272521, -0.968045, -2.066171, -0.670358, 1.399483, -1.148168], 
        [1.75378, 2.409629, 1.842674, 0.754906, -0.115614, 0.877219, 1.599362], 
        [-1.41176, 1.103801, 1.216514, 0.548866, 2.255482, -0.176342, 0.965265], 
        [-0.741689, 0.216645, -0.278025, 0.777175, 0.869239, -0.943004, -0.140957], 
        [-1.58593, 1.1796, -0.702286, 2.367875, 0.592748, 1.386158, 0.535978], 
        [0.58498, 0.62389, -0.425614, 0.530479, -1.818631, -1.593188, 1.591233]]

frame = DataFrame(data, columns=list('abcdefg'))

# your code here

display(frame)

Check your answer by running the cell below.

In [None]:
answers.test_10(frame)

## Missing values

Read about [Handling Missing Data](https://jakevdp.github.io/PythonDataScienceHandbook/03.04-missing-values.html)

### Exercise 11

In the `Series` `speeds` below, fill in the missing values. Use the speed from the previous datapoint available that is up to 3 datapoints away. Delete any datapoints that still do not have values afterwards.

**Hint:** take a good look at the [`fillna()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.fillna.html) documentation to see if there is some parameters that help here.

In [None]:
speeds = Series(
         [49, 51, None, None, 50, 48, 47, 50,
          51, 47, 46, None, 46, 48, 48, 48,
          None, 49, None, None, None, None,
          None, 50, 50, 50, 51, 52, 51, 50,
          None, 50, None, None, None, None,
          None, 50, 49, 48, 49, None, 50, 50, 49])

# your code here

display(speeds)

Check your answer by running the cell below.

In [None]:
answers.test_11(speeds)