# Our first Jupyter notebook

This is a _Jupyter notebook_, it contains _cells_ in which
we can evaluate program code.

There is built in support for _Julia_, _Python_, and _R_
(hence _Ju-Pyt-R_), here's some Python code:

In [None]:
def hello(name):
    print('hello ' + name + '!')

hello('world')

We're primarily going to run SQL code (see below) in our
notebooks, but I'll also show you some Python code later on
in the course.

You don't have to learn Python to take this course, there
will always be the option to use Java instead, but I
encourage you to have a look at Python, since it is growing
in popularity very quickly, and has become the 'lingua
franca' of data science (together with R).


## Introduction to relational databases

A [_Relational
Database_](https://en.wikipedia.org/wiki/Relational_database)
stores its data in
[_tables_](https://en.wikipedia.org/wiki/Table_(database),
where each table looks like a simple spreadsheet:

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;border-color:#999;margin:0px auto;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:#999;color:#444;background-color:#F7FDFA;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:#999;color:#fff;background-color:#26ADE4;}
.tg .tg-e3zv{font-weight:bold}
.tg .tg-9hbo{font-weight:bold;vertical-align:top}
.tg .tg-yw4l{vertical-align:top}
</style>
<table class="tg">
  <tr>
    <th class="tg-e3zv">year</th>
    <th class="tg-9hbo">category</th>
    <th class="tg-9hbo">laureate</th>
    <th class="tg-9hbo">motivation</th>
  </tr>
  <tr>
    <td class="tg-yw4l">2011</td>
    <td class="tg-yw4l">literature</td>
    <td class="tg-yw4l">Tomas Tranströmer</td>
    <td class="tg-yw4l">...</td>
  </tr>
  <tr>
    <td class="tg-yw4l">2011</td>
    <td class="tg-yw4l">physics</td>
    <td class="tg-yw4l">Adam Riess</td>
    <td class="tg-yw4l">...</td>
  </tr>
  <tr>
    <td class="tg-yw4l">2011</td>
    <td class="tg-yw4l">chemistry</td>
    <td class="tg-yw4l">Dan Shechtman</td>
    <td class="tg-yw4l">...</td>
  </tr>
  <tr>
    <td class="tg-yw4l">2011</td>
    <td class="tg-yw4l">medicine</td>
    <td class="tg-yw4l">Ralph Steinman</td>
    <td class="tg-yw4l">...</td>
  </tr>
</table>

A _row_ represents an item, and a _column_ represents a
property of the items.

In the example above, each row describes how someone was
awarded the Nobel price, and for each row, we have columns
showing what year the prize was awarded, in what category,
the name of the laureate, and the motivation (not shown
here).

The basic idea of relational databases is that all 'cells'
in the table should be simple values (no lists or objects!),
and that we can use simple operations from [_relational
algebra_](https://en.wikipedia.org/wiki/Relational_algebra)
to get information from it. We do it using a programming
language which is highly specialized for extracting
information, it is called
[SQL](https://en.wikipedia.org/wiki/SQL), which is short
hand for _Structured Query Language_. SQL can be pronounced
as either "S-Q-L", or "sequel".

SQL is divided into several sub-languages:

 + _DDL_ (_Data Definition Language_): constructs used to
   define the tables of a database,

 + _DML_ (_Data Manipulation Language_): statements used to
   query and manipulate data in a database,
   
 + _TCL_ (_Transaction Control Language_): commands used to
   handle transactions (we will return to what a transaction
   is later in the course), and
   
 + _DCL_ (_Data Control Language_): commands used to
   controll access to our data (we'll will not deal with
   them in this course).

This week we'll focus on DML, i.e., ways to query our
databases -- next week we'll look at how to design and
define our databases.

Today, we'll discuss the following operations:

 + _selection_: choosing some of the rows of a table

 + _projection_: choosing some of the columns of a table

 + _union_ and _intersection_: combining the rows of two
   tables (the tables must be compatible, which means that
   they have the same columns)
   
We'll also see ways to combine queries.

Next time we'll study some other very important operations,
which allows us to join several tables in interesting ways.

## SQLite

There are many different Relational Database Management
Systems
([RDMBS:es](https://en.wikipedia.org/wiki/Relational_database))
which implements SQL, in this course we'll use
[SQLite](https://en.wikipedia.org/wiki/SQLite), which is a
lightweight but still very powerful system.

The `SQLite` file `lect01.sqlite` contains all Nobel
Laureates in Physicis, Chemistry, Medicine, and Literature
since 1901 (the Economics prize isn't really a Nobel price,
its an award given by Riksbanken in memory of Alfred Nobel,
and the Peace Prize is awarded by Norwegians). At the bottom
of this page there's a description of how I created the
database.

To be able to write SQL queries in this notebook, we first
have to run:

In [2]:
%load_ext sql

The sql module is not an IPython extension.


And to work with our database, we import it with:

In [3]:
%sql sqlite:///lect01.sqlite

UsageError: Line magic function `%sql` not found.


Now we're good to go, we just have to prefix our SQL queries
with `%sql` (one line of SQL) or `%%sql` (several lines of
SQL, this is the form we will use in most cases).

## Some queries

A simple _SQL query_ can be written as:

```text
SELECT <what we're looking for>
FROM   <what table we're looking in>
```


This selects all rows of a given table. If we're only
interesting in some of the rows, and we normally are, we
write:

```text
SELECT <what we're looking for>
FROM   <what table we're looking in>
WHERE  <what items we're interested in>
```


The latter form is so common that it's got its own acronym:
"SFW" (short for `SELECT`-`FROM`-`WHERE`).

Let's use the first form above to see all Nobel prizes which
has been handed out:

In [None]:
%%sql


This is too much to look through, so let's first limit the
output to 10 rows:

In [None]:
%%sql


We can also _select_ only those prizes awarded in 2013.

In [None]:
%%sql


Observe that the query returns a new table, we'll soon see
that we can use the returned table in other queries.

**Q:** _What year did Einstein get his award, and why?_

This requires both a _selection_ (the row with Einstein's
award) and a _projection_ (only the year and motivation):

In [None]:
%%sql


Observe that the selection (what rows we're interested in)
is given in the `WHERE` clause, whereas the projection (what
columns we're interested in) is defined in the `SELECT`
clause (the naming is somewhat counter intuitive).

The names of the columns in the returned table is shown
above the actual output, if we want to rename any of the
columns in the returned table, we can use an _alias_:

In [None]:
%%sql


**Q:** _Who was awarded the physics prize in 1922?_

In [None]:
%%sql


**Q:** _Who were awarded the physics prize in 1922 and
1923?_ (Solve this problem in at least four different ways).

In [None]:
%%sql


In [None]:
%%sql


In [None]:
%%sql


In [None]:
%%sql


There are often several ways of doing things in SQL, and one
of the main points of using SQL is that the RDBMS tries to
optimize the operations it needs to fetch our data (there is
some seriously clever code running behind the scenes).

**Q:** _Who has been awarded the prize in medicine since
2010, ordered by name?_

In [None]:
%%sql


**Q:** _What year did Winston Churchill win a prize, and in
what category?_

In [None]:
%%sql


Using `LIKE` in our conditions, we get some rudimentary form
of wildcard matching (some SQL databases allow more advanced
regular expressions, but that's beyond the scope of this
course).


If we want to categorize our output, we can use a `CASE`
statement, it has the general form:

```sql
SELECT ..., 
       CASE 
           WHEN ... THEN ...
           WHEN ... THEN ...
           ELSE ...
       END AS <name>
FROM ...
```


**Q:** _Show all laureates in physics with a name beginning
with 'P', if they won the price before 1970 they're ancient,
if the won the prize between 1970 and 2000 they're veterans,
otherwise they're newbies._

In [None]:
%%sql


### `SELECT` and `SELECT DISTINCT`

**Q:** _What are the different categories of Nobel prizes?_

In [None]:
%%sql


Using `SELECT DISTINCT` we only get unique rows in our
output table.


### Using functions and aggregate functions

There are some functions we can apply to our values, each
RDBMS supplies their own set of functions. For example,
SQLite has a `substr` function:

```sql
substr(value, first_pos, [length])
```


**Q:** _What was the initial letters of the laureates in
year 2000?_

In [None]:
%%sql


An _aggregate function_ can be applied to all rows in a
table, and then returns only one value.

The standard aggregate functions are:

 + `AVG`: calculates the average for a given column
 + `COUNT`: counts the rows in a given table
 + `MIN`: gets the minimum value of a given column
 + `MAX`: gets the maximum value of a given column
 + `SUM`: calculates the sum of a given column


**Q:** _How many of the laureates has had a first name
beginning with an 'A'?_

In [None]:
%%sql


**Q:** _What year was the first Nobel prize awarded?_

In [None]:
%%sql


**Q:** _How many Nobel prizes for chemistry has been
awarded?_

In [None]:
%%sql


### Grouping and aggregates

**Q:** _How many laureates are there in each category? Which
category has seen the most laureates?_

In [None]:
%%sql


Using `GROUP BY` groups the output in a way which lets us
use aggregate functions for each of the different groups
(the result of a `GROUP BY` query without an aggregate
function often looks weird). We can group on _one column or
more_.

**Q:** _How many olympic games has each continent hosted?_

In [None]:
%%sql


**Q:** _When was the first olympic games in each continent?_

In [None]:
%%sql


If we add a `HAVING` clause, we can filter groups in the
same way we filter rows with a `WHERE` clause.

**Q:** _Which countries has hosted the summer olympics more
than once?_

In [None]:
%%sql


**Q:** _List the continents in descending order by the
number of times they've hosted the summer olympics_

In [None]:
%%sql


**Q:** _Show an 'histogram' over the the initial letter of
the names of all Nobel laureates_

In [None]:
%%sql


**Q:** _Show an 'histogram' over the the initial letter of
the names of all Nobel laureates, for each category_

In [None]:
%%sql


### Subqueries

**Q:** _Has the Nobel prize for literature ever been split?_

In [None]:
%%sql


A useful pattern is:

```sql
SELECT ...
FROM   ...
WHERE  ... IN
       (SELECT ...
        FROM ...
        WHERE ...)
```


The second query is called a _subquery_.

**Q:** _Which literature laureates split their price?_

In [None]:
%%sql


There is another form of subquery which we'll return to
later.

**Q:** _Who has won the literature prize in a year when at
least one chemistry laureate had a name beginning with 'L'?_

In [None]:
%%sql
SELECT ...

In [None]:
%%sql
SELECT year
FROM   olympics
WHERE  continent = 'Europe'

**Q:** _Who has shared the chemistry prize with exactly one
other laureate in years when the summer olympics were held
in Europe?_

In [None]:
%%sql
SELECT year, laureate

**Q:** _Has anyone won more than one price?_

In [None]:
%%sql


**Q:** _Has anyone won more than one price in the same
category?_

In [None]:
%%sql


In [None]:
%%sql


In [None]:
%%sql


In [None]:
%%sql


Is there any redundance in the table of the summer olympics?



## Generating the database

This is a description of how I created the database, it's
just for the curious, and not part of the course.

The data is copied from
[`Nobelprize.org`](https://www.nobelprize.org/nobel_prizes/lists/all/create_list.html),
and then pasted into a text file `nobel.csv` (`.csv` for
comma-separated-values). I used `Emacs` to tidy things up
(just some simple macros), and then I imported the text file
into `sqlite` using the following script (I put it in the
text file `nobel.sql`):

```{sql}
DROP TABLE IF EXISTS nobel;
CREATE TABLE nobel (
  year        INT,
  category    TEXT,
  laureate    TEXT,
  motivation  TEXT
);

.mode csv
.separator ';'
.import nobel.csv nobel
.save nobel.sqlite
```


To create a `sqlite`-file `nobel.sqlite` with all laureates,
I only had to run the following command in a terminal (the
exclamation sign tells `jupyter` to execute a shell
command):

In [None]:
!sqlite3 nobel.sqlite < nobel.sql

You can find information about running
[SQLite](http://sqlite.org/) from a command line
[here](https://sqlite.org/cli.html). I could have made
things somewhat easier for myself by adding an extra header
row first in my `.csv`-file, but I wanted to define the
table myself, to make sure that `year` was saved as integers
-- we'll return to this later in the course.