# What is a Database?

  > a structured set of data held in a computer, especially one that is accessible in various ways. a database covering nine million workers. [ as modifier ] : database systems.
  >
  > _Oxford Dictionary of English_

  > a usually large collection of data organized especially for rapid search and retrieval (as by a computer)
  >
  > https://www.merriam-webster.com/dictionary/database

# What is a Database Management System?

  > A database-management system (DBMS) is a computer-software application that interacts with end-users, other applications, and the database itself to capture and analyze data. A general-purpose DBMS allows the definition, creation, querying, update, and administration of databases.
  > 
  > https://en.wikipedia.org/wiki/Database

# Your turn!

![](http://www.twenty19.com/blog/wp-content/uploads/2017/07/typing2.gif)

Build the likely simplest DB in the world!

# What is this?

```bash
#!/usr/bin/env bash
db_set () {
    echo "$1,$2" >> database
}
db_get () {
    grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}
```

  * Type in the script above using the editor of your choice.
  * **OBS** keep the spaces as they are, some of them are important.
  * Save the script above in a file called `simple_db.sh`. 
  * Then source this file via `source simple_db.sh`.

## Using the `simple_db`


Run the following command subsequently and see what happens.

```bash
$ db_set 123456 '{"name":"London","attractions":["Big Ben","London Eye"]}' 
$ db_set 42 '{"name":"San Francisco","attractions":["Golden Gate Bridge"]}'
$ db_get 42
```

## Exercise 1

  * Talk to your neighbour and:
    - Explain each other what the script -the DBMS `simple_db.sh`- does when _writing_ a record.
    - Explain each other what the script -the DBMS `simple_db.sh`- does when _reading_ a record.
  * What are really good features of our `simple_db.sh` DB?
  * Additionally, create a list of three potential issues of our simple database.

### Performance Issue

  * Finding a datum for a key takes $O(n)$, i.e., linear time.
  * What does that mean?

In [29]:
def inpect_each_element_in_a_list(list_len):
    for el in range(list_len):
        if el == 'a':
            print(f'Found: {el}')


%timeit inpect_each_element_in_a_list(1000000)
%timeit inpect_each_element_in_a_list(1000000 * 2)
%timeit inpect_each_element_in_a_list(1000000 * 10)

37.5 ms ± 802 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
74.9 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
377 ms ± 8.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### How can we speed-up the look up?

  * Can we speed up the look up operation to constant time, i.e., $O(1)$
  * Which data structure do you know in your favorite programming language that supports constant time look up?

In [21]:
import random
import string


def generate_random_data():
    chars = string.ascii_letters + string.digits
    return ''.join(random.sample(chars, 10))


list_len = 10
simple_index = {el: generate_random_data() for el in range(list_len)}
simple_index

{0: '4L5PFMAcbZ',
 1: '5pcPV26ufe',
 2: 'H0cNmdQoF3',
 3: 'eoNwvz2dIG',
 4: 'LjJgaYcl1E',
 5: '7F8ZebkxKX',
 6: '4nGVc1dl5J',
 7: 'MXzHa7jfd0',
 8: 'WfB3t8hdNp',
 9: 'j8pk3biGgr'}

In [24]:
list_len = 1000000
simple_index = {el: generate_random_data() for el in range(list_len)}

In [30]:
%timeit simple_index[10000]
%timeit simple_index[10000 * 2]
%timeit simple_index[10000 * 50]

48.5 ns ± 1.41 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
48.4 ns ± 0.567 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
48.5 ns ± 0.778 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


### We just built an _index_!

  * What is the drawback of using a Hashmap as a datastructure for keeping a database index?
  * Talk to your neighbour and name two potential drawbacks.

# Why does this matter?

  * One of the oldest DBMS -the Unix library `dbm`- works in that way, see in the following.
  * Modern **key-value stores** such as _Riak_ (http://basho.com/products/riak-kv/) with the storage engine _Bitcask_ (https://docs.basho.com/riak/kv/2.0.7/setup/planning/backend/bitcask/) work that way.


## A closer look at `gdbm`, the _GNU dbm_

Let's say you have a program in some language -here Ruby- and you want to persist some _key-value_ data.

```ruby
require 'gdbm'

gdbm = GDBM.new("fruitstore.db")
gdbm["ananas"]    = "3"
gdbm["banana"]    = "8"
gdbm["cranberry"] = "4909"
gdbm["ananas"]    = "42"
gdbm.close
```

The example is adapted from:
https://ruby-doc.org/stdlib-2.5.0/libdoc/gdbm/rdoc/GDBM.html

```ruby
require 'gdbm'

gdbm = GDBM.new("fruitstore.db")
gdbm.each_pair do |key, value|
  print "#{key}: #{value}\n"
end
gdbm.close
```

In case you have no Ruby installed on your system, you can run the program as in the following:

```bash
docker run -it --rm -v $(pwd):/src -w /src helgecph/pythonruby sh -c "ruby gdbm_write.rb;ruby gdbm_read.rb"
```

What do all the Docker switches mean and do?

### Exercise 2

  * Inspect the file `fruitstore.db`
    - What is the output of `file fruitstore.db`?
    - What can you see when you look inside?
    - How does that compare to our earlier file from 
   
In case you do not have a tool like `xxd` or `hexdump` installed on your machine you can see the contents of the file for example with:
```bash
docker run -it --rm -v $(pwd):/src -w /src helgecph/pythonruby sh -c "hexdump -C fruitstore.db"
```

A dumb Python GDBM implementation
https://github.com/python/cpython/blob/3.6/Lib/dbm/dumb.py

# Your turn at home!

![](http://www.twenty19.com/blog/wp-content/uploads/2017/07/typing2.gif)

  * Build a `simple_db` in the programming language of your choice.
    - Implement a Hashmap-based index.
    - Implement functionality to store your data on disk in a binary file.
    - Implement functionality to read your data from disk again, so that you can reinstantiate your database after a shutdown.