# Feb 7th - Document-oriented DB 1

#### Literature
* K. Banker, P. Bakkum, S. Verch, D. Garrett  _"MongoDB in Action, Second Edition"_
    - Chapter 1
    - Chapter 2
    - Chapter 3
* https://github.com/datsoftlyngby/soft2019spring-databases/blob/master/literature/session_2.zip
* An important part of databases are *transactions*. In case you have not read about the **ACID** criteria, this blog is both correct and short: https://blog.yugabyte.com/a-primer-on-acid-transactions/.

#### Handin

See end of this document

#### Study activity

  * Read 3 hrs (notice - a lot to read)
  * Exercises 5 hrs

# Follow up on last hand-in



# Data Models


  > Data models are perhaps the most important part of developing software, because they have such a profound effect: not only on how the software is written, but also on how we think about the problem that we are solving.
  >
  > Martin Kleppmann, _Designing Data-Intensive Applications
  
  (_His usage of "data model" is what some call "database schema"_)

## What was the Data Model in the Last Lecture and Exercise?

# Object relational mismatch

```java
class Person {
    String name;
    int yearBorn;
    Image avatar;
    HashMap<String, String> socialMedia;
    List<String> posts;
    List<Person> friends;
}
```

## In class exercise: Make a relational model wich can Person objects

## The Document Data Model


A document is essentially a set of property names and their values.

```json
{ name: "Þjóðbjörg", born:1998, ...}
```

The values can be simple data types, such as _strings_, _numbers_, and _dates_. 

But these values can also be arrays and even other documents.

Objects: {key<sub>0</sub>:value<sub>0</sub>, key<sub>1</sub>:value<sub>1</sub>, ...}<br>
Arrays: \[value<sub>0</sub>, value<sub>1</sub>, value<sub>2</sub>, ...\]



```javascript
{
  _id: ObjectID('4bd9e8e17cefd644108961bb'),     // _id field, primary key
  title: 'Adventures in Databases',
  url: 'http://example.com/databases.txt',
  author: 'msmith',
  vote_count: 20,
  tags: ['databases', 'mongodb', 'indexing'],    // Tags stored as array of strings
  image: {                                       // Attribute pointing to another document
    url: 'http://example.com/db.jpg',
    caption: 'A database.',
    type: 'jpg',
    size: 75381,
    data: 'Binary'
  },
  comments: [                                    // Comments stored as array of comment objects
    {
      user: 'bjones',
      text: 'Interesting article.'
    },
    {
      user: 'sverch',
      text: 'Color me skeptical!'
    }
  ]
}
```

Note, a JSON document needs double quotes everywhere except for numeric values. The listing shows the JavaScript version of a JSON document. Internally, MongoDB stores documents in a format called _Binary JSON_, or **BSON**.

## The Document Data Model in MongoDB

On top of documents, MongoDB has the concept of _collections_. _Collections_ can be considered as grouped _documents_.

_Collections_ are similar to tables in the relational world.

The document-oriented data model naturally represents data in an aggregate form, allowing you to work with an object holistically.

# Architecture of Databases System for Experimenting

## A Database Container

![](images/DB_containers_internal.png)

```bash
$ docker run --rm --publish=27017:27017 --name dbms -d mongo:latest
$ docker run -it --link dbms:mongo --rm mongo sh -c 'exec mongo "$MONGO_PORT_27017_TCP_ADDR:$MONGO_PORT_27017_TCP_PORT/test"'
```


**OBS** Do not do this in production. This is a setup that we will use for experimentation only!

In [158]:
%%bash
#"docker run --rm --publish=27017:27017 --name dbms -d mongo:latest
docker run -i --link dbms:mongo --rm mongo sh -c 'exec mongo "$MONGO_PORT_27017_TCP_ADDR:$MONGO_PORT_27017_TCP_PORT/test"'

MongoDB shell version v4.0.5
connecting to: mongodb://172.17.0.2:27017/test?gssapiServiceName=mongodb
Implicit session: session { "id" : UUID("5e3dd146-55cf-4f86-828b-98d75443d13b") }
MongoDB server version: 4.0.5
bye


## A Database Container

What is the issue with such a setup?

## Containerized DB Setup for Production

![](images/DB_containers_external.png)


```bash
$ docker run --rm -v $(pwd)/data:/data/db --publish=27017:27017 --name dbms -d mongo:latest
$ docker run -it --link dbms:mongo --rm mongo sh -c 'exec mongo "$MONGO_PORT_27017_TCP_ADDR:$MONGO_PORT_27017_TCP_PORT/test"'
```

## Containerized DB Setup for Production


What is the advantage of such a setup?

In [1]:
%%bash
ls -ltra ../data

total 104
-rw-r--r--@  1 rhp  staff     0 Jan 23 10:20 .gitkeep
drwxr-xr-x  13 rhp  staff   442 Feb  2 15:39 ..
-rw-------   1 rhp  staff   114 Feb  7 18:05 storage.bson
-rw-------   1 rhp  staff  4096 Feb  7 18:05 sizeStorer.wt
-rw-------   1 rhp  staff     2 Feb  7 18:05 mongod.lock
drwx------   5 rhp  staff   170 Feb  7 18:05 journal
-rw-------   1 rhp  staff  4096 Feb  7 18:05 index-3--7174197220278716654.wt
-rw-------   1 rhp  staff  4096 Feb  7 18:05 index-1--7174197220278716654.wt
-rw-------   1 rhp  staff  4096 Feb  7 18:05 collection-2--7174197220278716654.wt
-rw-------   1 rhp  staff  4096 Feb  7 18:05 collection-0--7174197220278716654.wt
-rw-------   1 rhp  staff  4096 Feb  7 18:05 _mdb_catalog.wt
-rw-------   1 rhp  staff  4096 Feb  7 18:05 WiredTigerLAS.wt
-rw-------   1 rhp  staff  4096 Feb  7 18:05 WiredTiger.wt
-rw-------   1 rhp  staff   889 Feb  7 18:05 WiredTiger.turtle
-rw-------   1 rhp  staff    21 Feb  7 18:05 WiredTiger.lock
-rw-------   1 rhp  staff    45 Feb  

## Starting a MongoDB Instance for the Lectures

  * Via a container, see https://hub.docker.com/_/mongo/:
  ```bash
  docker run --rm --publish=27017:27017 --name dbms -d mongo:latest
  ```
  ```bash
  docker run --rm -v $(pwd)/data:/data/db --publish=27017:27017 --name dbms -d mongo:latest
  ```
  * Installation in the provided VM, see https://github.com/datsoftlyngby/soft2019spring-databases
  * Installation of MongoDB on the host machine, see https://docs.mongodb.com/manual/administration/install-community/



## Connecting to MongoDB

  * Via a Mongo shell installed in a container: 
  ```bash
  docker run -it --link dbms:mongo --rm mongo sh -c 'exec mongo "$MONGO_PORT_27017_TCP_ADDR:$MONGO_PORT_27017_TCP_PORT/test"'```
  * Via the Mongo shell installed on a host: 
  ```bash
  mongo --host 127.0.0.1:27017
  ```
  * Via the GUI client RoboMongo, see https://robomongo.org/download
![](https://robomongo.org/static/screens-transparent-6e2a44fd.png)
  * Via your own application, see in the end of the lecture.

# The MongoDB Query Language

  > MongoDB queries are represented as a JSON-like structure, just like documents. To build a query, you specify a document with properties you wish the results to match. MongoDB treats each property as having an implicit boolean AND. It natively supports boolean OR queries, but you must use a special operator ($or) to achieve it. In addition to exact matches, MongoDB has operators for greater than, less than, etc.
  >
  > https://www.safaribooksonline.com/library/view/mongodb-and-python/9781449312817/ch02s06.html
  
The MongoDB documentation calls the Query Language itself _Query Documents_, see https://docs.mongodb.com/manual/tutorial/query-documents/.

## Hey, I think I know everything you want to tell us here! 

![](http://static3.businessinsider.com/image/4fbfb86becad044879000001-506-253/suddenly-startups-have-gotten-very-boring.jpg)

Cool! Then I would like you to ask for your help. I would like those who know Mongo well to check out MySQL 8.

In particular:

  * Figure out how to use the JSON support in MySQL 8.0
  * Figure out how to do JSON queries
  * Figure out how to do queries that merge SQL and JSON

See:
  * https://mysqlserverteam.com/json_table-the-best-of-both-worlds/
  


## Switching to a collection / Creating a new collection

In [2]:
import pymongo
from pymongo import MongoClient
client = MongoClient()
db = client.testDB
users = db.users
print("Done")

Done


## Select All Documents in a Collection

To select all documents in the collection, pass an empty document as the query filter parameter to the `find` method.

In [19]:
import pprint

def pp(obj):
    pprint.pprint(obj)
    
def ppall(col):
    for p in col:
        pp( p )
ppall( users.find({} ) )

{'_id': ObjectId('5c594dea3d356cc77adcbd71'),
 'age': 22,
 'favorites': {'cafe': ['Vaffelbageren', 'Café BoPa', 'Conditori La Glace']},
 'username': 'Hansen'}
{'_id': ObjectId('5c594e163d356cc77adcbd72'),
 'age': 24,
 'country': 'Denmark',
 'username': 'Nielsen'}
{'_id': ObjectId('5c5954553d356cc77adcbd73'), 'country': 'Canada'}


```bash
db.users.find({})
``` 

is synonymous to `db.users.find()`. 

However the former is more explicit and preferred.

In SQL the above query correpsonds to


```sql
SELECT * FROM users
```

## Inserting Data

To be able to query some data in the following, let's first create some in the database.

In [86]:
res = users.insert_one({"username": "Møller", "age": 25})
pp( res )

<pymongo.results.InsertOneResult object at 0x1081a97c8>


In [28]:
ppall( users.find({}) )

## Deleting Documents

### Delete a Single Document 

In [27]:
res = users.delete_many({"username": "Møller"})
print("Deleted: " + str(res.deleted_count) )

Deleted: 6


### Delete all Documents of a Collection

In [37]:
users.delete_many({}).deleted_count

0

In [40]:
ppall( users.find({}) )

### Deleting an Entire Collection

In [45]:
db.users.drop()

### What is the _id field?

The `_id` value can be considered a document’s primary key. Every MongoDB document requires an `_id`.

If none is present at creation time, a special MongoDB ObjectID will be generated and added to the document.

In [46]:
res = db.users.insert_one({"username": "Hansen", "age": 22})
res.inserted_id

ObjectId('5c594dea3d356cc77adcbd71')

In [48]:
db.users.insert_one({"username": "Nielsen", "age": 24})

<pymongo.results.InsertOneResult at 0x108149f88>

In [70]:
ppall( db.users.find({}) )

{'_id': ObjectId('5c594dea3d356cc77adcbd71'), 'age': 22, 'username': 'Hansen'}
{'_id': ObjectId('5c594e163d356cc77adcbd72'), 'age': 24, 'username': 'Nielsen'}


In [59]:
users.count_documents({})

2

## Query Documents in a Collection

### Matching Selector

A query selector is a document that is used to match against all documents in the collection. 

It specifies the _equality condition_, i.e. fields and values, which must be equal in the documents you want to select.

To specify equality conditions, use `<field>:<value>` expressions in the query selector document:

```
{ <field1>: <value1>, ... }
```

In [64]:
res = users.find({"username": "Hansen"})
ppall( res )

{'_id': ObjectId('5c594dea3d356cc77adcbd71'), 'age': 22, 'username': 'Hansen'}


That query is equivalent to the following SQL query:

```SQL
SELECT * FROM users WHERE username = "Hansen"
```

In [66]:
res = db.users.find({ "username": "Hansen",
                "age" : 22 })
ppall( res )

{'_id': ObjectId('5c594dea3d356cc77adcbd71'), 'age': 22, 'username': 'Hansen'}


The matching selector with various fields and values:

```javascript
db.users.find({ username: "Hansen",
                "age" : 22 })
```

is eqiuvalent to the following with an explicit conjunction (`$and`).

In [68]:
ppall( db.users.find({ "$and": [ { "username": "Hansen" },
                        { "age": 22 } ] }) )

{'_id': ObjectId('5c594dea3d356cc77adcbd71'), 'age': 22, 'username': 'Hansen'}


In [71]:
ppall( db.users.find({ "$or": [ { "username": "Nielsen" }, 
                       { "username": "Hansen" } ]}) )

{'_id': ObjectId('5c594dea3d356cc77adcbd71'), 'age': 22, 'username': 'Hansen'}
{'_id': ObjectId('5c594e163d356cc77adcbd72'), 'age': 24, 'username': 'Nielsen'}


## The `$in` Operator More Compactly as Containment Check

In [72]:
ppall( db.users.find( 
    { "username": { "$in": [ "Nielsen", "Hansen" ] } } ) )

{'_id': ObjectId('5c594dea3d356cc77adcbd71'), 'age': 22, 'username': 'Hansen'}
{'_id': ObjectId('5c594e163d356cc77adcbd72'), 'age': 24, 'username': 'Nielsen'}


## Regular Expressions in Matching Queries

You can use reqular expressions in you matching queries in either of the two following forms:

```
db.users.find({ username: /en$/ })
db.users.find({ username: { $regex: "en$" } })
```
(Notice: the first form is language dependent, the last is generic)

In [77]:
ppall( users.find({ "username": {"$regex": "en$"} }) )

{'_id': ObjectId('5c594dea3d356cc77adcbd71'), 'age': 22, 'username': 'Hansen'}
{'_id': ObjectId('5c594e163d356cc77adcbd72'), 'age': 24, 'username': 'Nielsen'}


## Range Queries

The Mongo query language supports the following query operators: `$eq`, `$gt`, `$gte`, `$in`, `$lt`, `$lte`, `$ne`, `$nin`

In [78]:
ppall( users.find( { "username": {"$regex": "en$"}, 
                 "age": { "$lte": 24 } } ))

{'_id': ObjectId('5c594dea3d356cc77adcbd71'), 'age': 22, 'username': 'Hansen'}
{'_id': ObjectId('5c594e163d356cc77adcbd72'), 'age': 24, 'username': 'Nielsen'}


## Updating Documents

Generally, there are two types of updates with different semantics and use cases:

  * Updating a single document or many documents, i.e., modification of corresponding fields and values.
  * Replacement of old documents with new ones.
  


### Operator Update

In [82]:
res = users.update( { "username": "Nielsen" }, 
                 { "$set": { "country": "Denmark" } } )
pp( res )

{'n': 0, 'nModified': 0, 'ok': 1.0, 'updatedExisting': False}


  


In [118]:
res = users.update_one( { "username": "Nielsen" }, 
                 { "$set": { "country": "Denmark" } } )
res.raw_result

{'n': 1, 'nModified': 0, 'ok': 1.0, 'updatedExisting': True}

In [87]:
ppall( db.users.find({"username": "Møller"}) )

{'_id': ObjectId('5c5954553d356cc77adcbd73'), 'age': 25, 'username': 'Møller'}


### Replacement Update

In [120]:
ppall( users.update( { "username": "Møller" }, 
                 { "country": "Canada" } ))

'n'
'nModified'
'ok'
'updatedExisting'


  


In [95]:
ppall (db.users.find( {} ) )

{'_id': ObjectId('5c594dea3d356cc77adcbd71'), 'age': 22, 'username': 'Hansen'}
{'_id': ObjectId('5c594e163d356cc77adcbd72'), 'age': 24, 'username': 'Nielsen'}
{'_id': ObjectId('5c5954553d356cc77adcbd73'), 'country': 'Canada'}


In [97]:
ppall( users.find( { "country": "Canada" } ))

{'_id': ObjectId('5c5954553d356cc77adcbd73'), 'country': 'Canada'}


Let's add the username back to the record for our example.

In [100]:
res = db.users.update( { "country": "Canada" }, 
                 { "$set": { "username": "Møller" } } )
ppall(res)

'n'
'nModified'
'ok'
'updatedExisting'


  


In [101]:
ppall( db.users.find( { "country": "Canada" } ) )

{'_id': ObjectId('5c5954553d356cc77adcbd73'),
 'country': 'Canada',
 'username': 'Møller'}


### Removing a field from a document

Value can be removed as with the help of the `$unset` operator.

In [102]:
res = db.users.update( { "username": "Møller" }, 
                 { "$unset": { "country": 1 } } )
ppall(res)

'n'
'nModified'
'ok'
'updatedExisting'


  


In [103]:
ppall( db.users.find( { "username": "Møller" } ) )

{'_id': ObjectId('5c5954553d356cc77adcbd73'), 'username': 'Møller'}


### Complex Updates

In [104]:
ppall( db.users.update( { "username": "Møller" }, 
                 {  "$set": {
                      "favorites": { 
                        "restaurant": [ "La Petanque", "Hija de Sanchez" ], 
                        "cafe": [ "Paludan Bog & Café", "Café Retro", "Conditori La Glace" ] 
                      }
                    } 
                  }))

'n'
'nModified'
'ok'
'updatedExisting'


  """


In [105]:
ppall(db.users.update( { "username": "Hansen" }, 
                 {  "$set": {
                      "favorites": { 
                        "cafe": [ "Vaffelbageren", "Café BoPa", "Conditori La Glace" ] 
                      }
                    } 
                 }))

'n'
'nModified'
'ok'
'updatedExisting'


  after removing the cwd from sys.path.


In [106]:
ppall(db.users.find({}))

{'_id': ObjectId('5c594dea3d356cc77adcbd71'),
 'age': 22,
 'favorites': {'cafe': ['Vaffelbageren', 'Café BoPa', 'Conditori La Glace']},
 'username': 'Hansen'}
{'_id': ObjectId('5c594e163d356cc77adcbd72'), 'age': 24, 'username': 'Nielsen'}
{'_id': ObjectId('5c5954553d356cc77adcbd73'),
 'favorites': {'cafe': ['Paludan Bog & Café',
                        'Café Retro',
                        'Conditori La Glace'],
               'restaurant': ['La Petanque', 'Hija de Sanchez']},
 'username': 'Møller'}


In [111]:
ppall(users.find( { "favorites.cafe": "Conditori La Glace" } ))

{'_id': ObjectId('5c594dea3d356cc77adcbd71'),
 'age': 22,
 'favorites': {'cafe': ['Vaffelbageren', 'Café BoPa', 'Conditori La Glace']},
 'username': 'Hansen'}
{'_id': ObjectId('5c5954553d356cc77adcbd73'),
 'favorites': {'cafe': ['Paludan Bog & Café',
                        'Café Retro',
                        'Conditori La Glace',
                        'Lagkagehuset'],
               'restaurant': ['La Petanque', 'Hija de Sanchez']},
 'username': 'Møller'}


#### Adding Elements to Nested Sets

Suppose we know that any user who likes _Café Retro_ also likes _Lagkagehuset_ and that our database shall reflect this fact.

##### What does the `false`and the `true` mean?

```javascript
> db.users.update
function (query, obj, upsert, multi) {
...
}
```
(`Upsert` - An operation that inserts rows into a database table if they do not already exist, or updates them if they do`)

In [110]:
ppall( db.users.update( { "favorites.cafe": "Café Retro" }, 
                 { "$addToSet": { "favorites.cafe": "Lagkagehuset" } }, 
                 False, True))

'n'
'nModified'
'ok'
'updatedExisting'


  This is separate from the ipykernel package so we can avoid doing imports until


# Querying with Indexes

Let's start with creating a large collection of numbers.

You know Javascript. How do I specify a loop from `0` to `20000` in Javascript?

In [127]:
res = db.numbers.drop()
pp(res)

None


In [128]:
for i in range(0,20000): 
    db.numbers.insert_one( { "num": i } ); 

In [125]:
db.numbers.count_documents({})

20000

In [135]:
res = db.numbers.find( {} )
ppall( res.limit(5) )

{'_id': ObjectId('5c59610c3d356cc77add0b94'), 'num': 0}
{'_id': ObjectId('5c59610c3d356cc77add0b95'), 'num': 1}
{'_id': ObjectId('5c59610c3d356cc77add0b96'), 'num': 2}
{'_id': ObjectId('5c59610c3d356cc77add0b97'), 'num': 3}
{'_id': ObjectId('5c59610c3d356cc77add0b98'), 'num': 4}


In [137]:
ppall( db.numbers.find( { "num": { "$gt": 20, "$lt": 25 } } ) )

{'_id': ObjectId('5c59610c3d356cc77add0ba9'), 'num': 21}
{'_id': ObjectId('5c59610c3d356cc77add0baa'), 'num': 22}
{'_id': ObjectId('5c59610c3d356cc77add0bab'), 'num': 23}
{'_id': ObjectId('5c59610c3d356cc77add0bac'), 'num': 24}


## Execution Statistics

When any database receives a query, it must plan out how to execute it. This is called a _query plan_.

The `explain` method describes query paths and allows developers to diagnose slow operations by determining which indexes a query has used.

In [141]:
db.numbers.find( { 
                   "num": { 
                     "$gt": 19995 
                   } 
                 } ).explain()["executionStats"]

{'executionSuccess': True,
 'nReturned': 4,
 'executionTimeMillis': 10,
 'totalKeysExamined': 0,
 'totalDocsExamined': 20000,
 'executionStages': {'stage': 'COLLSCAN',
  'filter': {'num': {'$gt': 19995}},
  'nReturned': 4,
  'executionTimeMillisEstimate': 0,
  'works': 20002,
  'advanced': 4,
  'needTime': 19997,
  'needYield': 0,
  'saveState': 156,
  'restoreState': 156,
  'isEOF': 1,
  'invalidates': 0,
  'direction': 'forward',
  'docsExamined': 20000},
 'allPlansExecution': []}

## Creating an Index in MongoDB

On top of the indexes that you create manually, every collection in MongoDB has an index on the `_id` field, which is created automatically for every collection.

MongoDB indexes use a _B-tree_ data structure. We will return to this datastructure later.

In [145]:
db.numbers.create_index( "num" )

'num_1'

In [147]:
ppall( db.numbers.list_indexes()) # getIndexes

{'key': SON([('_id', 1)]),
 'name': '_id_',
 'ns': 'testDB.numbers',
 'v': 2}
{'key': SON([('num', 1)]),
 'name': 'num_1',
 'ns': 'testDB.numbers',
 'v': 2}


In [148]:
db.numbers.find( { 
                   "num": { 
                     "$gt": 19995 
                   } 
                 } ).explain()["executionStats"]

{'executionSuccess': True,
 'nReturned': 4,
 'executionTimeMillis': 5,
 'totalKeysExamined': 4,
 'totalDocsExamined': 4,
 'executionStages': {'stage': 'FETCH',
  'nReturned': 4,
  'executionTimeMillisEstimate': 0,
  'works': 5,
  'advanced': 4,
  'needTime': 0,
  'needYield': 0,
  'saveState': 0,
  'restoreState': 0,
  'isEOF': 1,
  'invalidates': 0,
  'docsExamined': 4,
  'alreadyHasObj': 0,
  'inputStage': {'stage': 'IXSCAN',
   'nReturned': 4,
   'executionTimeMillisEstimate': 0,
   'works': 5,
   'advanced': 4,
   'needTime': 0,
   'needYield': 0,
   'saveState': 0,
   'restoreState': 0,
   'isEOF': 1,
   'invalidates': 0,
   'keyPattern': {'num': 1},
   'indexName': 'num_1',
   'isMultiKey': False,
   'multiKeyPaths': {'num': []},
   'isUnique': False,
   'isSparse': False,
   'isPartial': False,
   'indexVersion': 2,
   'direction': 'forward',
   'indexBounds': {'num': ['(19995, inf.0]']},
   'keysExamined': 4,
   'seeks': 1,
   'dupsTested': 0,
   'dupsDropped': 0,
   'seenInval

In [149]:
for i in range(20000,50000): 
    db.numbers.insert_one( { "num": i } ); 
    

In [153]:
ppall(db.numbers.find( { 
                   "num": { 
                     "$gt": 49995 
                   } 
                 } ))

{'_id': ObjectId('5c59679c3d356cc77addcee0'), 'num': 49996}
{'_id': ObjectId('5c59679c3d356cc77addcee1'), 'num': 49997}
{'_id': ObjectId('5c59679c3d356cc77addcee2'), 'num': 49998}
{'_id': ObjectId('5c59679c3d356cc77addcee3'), 'num': 49999}


# Connect to MongoDB from a Java Maven Project

https://mongodb.github.io/mongo-java-driver/3.0/driver/getting-started/installation-guide/

```xml
<dependency>
    <groupId>org.mongodb</groupId>
    <artifactId>mongodb-driver</artifactId>
    <version>3.9.1</version>
</dependency>
```

Based on https://mongodb.github.io/mongo-java-driver/3.0/driver/getting-started/quick-tour/

```java
package dk.cphbusiness.db.meassurements;

import com.mongodb.MongoClient;
import com.mongodb.MongoClientURI;
import com.mongodb.client.MongoCollection;
import com.mongodb.client.MongoDatabase;
import org.bson.Document;

public class MongoTest {

    public static void main(String[] args) {
        MongoClientURI connStr = new MongoClientURI("mongodb://localhost:27017");
        MongoClient mongoClient = new MongoClient(connStr);

        MongoDatabase db = mongoClient.getDatabase("test-database");
        MongoCollection<Document> collection = db.getCollection("tweets");

        Document myDoc = collection.find().first();
        System.out.println(myDoc.toJson());
    }
}
```

# Importing Data

You can either write a program which inserts documents into a database or you use MongoDB's CLI import tool.

```bash
mongoimport --drop --db social_net --collection tweets --type csv --headerline --file testdata.manual.2009.06.14.csv
```


In [155]:
%%bash
mongoimport --help

bash: line 1: mongoimport: command not found


In [13]:
db.books.drop()
"all dropped"

'all dropped'

In [14]:
from urllib.request import urlopen
import json
from bson.json_util import loads

link = "https://raw.githubusercontent.com/ozlerhakan/mongodb-json-files/master/datasets/catalog.books.json"
f = urlopen(link)
myfile = f.read()
allBooks = myfile.decode("utf-8")
count = 0
for line in allBooks.splitlines():
    jsonbook = loads(line)
    #print( str(count) +": " + str(jsonbook) )
    db.books.insert_one(jsonbook)
    count = count + 1
db.books.count_documents({})

431

In [23]:
ppall(db.books.find( {"title": {"$regex": "Android"}}, 
                    {"title":1, "authors":1, "_id":0}) ) # project 

{'authors': ['W. Frank Ableson', 'Charlie Collins', 'Robi Sen'],
 'title': 'Unlocking Android'}
{'authors': ['W. Frank Ableson', 'Robi Sen'],
 'title': 'Android in Action, Second Edition'}
{'authors': ['Charlie Collins', 'Michael D. Galpin', '', 'Matthias Kaeppler'],
 'title': 'Android in Practice'}
{'authors': ['Matthias Kaeppler', 'Michael D. Galpin', 'Charlie Collins'],
 'title': 'Android in Practice'}
{'authors': ['W. Frank Ableson', 'Robi Sen', 'Chris King', 'C. Enrique Ortiz'],
 'title': 'Android in Action, Third Edition'}
{'authors': ['Carlos M. Sessa'], 'title': '50 Android Hacks'}


# Your turn at home!


## Assignment 2 - Analysis of Twitter Data

Your task is to implement a small database application, which imports a dataset of Twitter tweets from the CSV file into database.

Your application has to be able to answer queries corresponding to the following questions:

  1. How many Twitter users are in the database?
  2. Which Twitter users link the most to other Twitter users? (Provide the top ten.)
  * Who is are the most mentioned Twitter users? (Provide the top five.)
  * Who are the most active Twitter users (top ten)?
  * Who are the five most grumpy (most negative tweets) and the most happy (most positive tweets)? (Provide five users for each group)

Your application can be written in the language of your choice. It must have a form of UI but it is not important if it is an API, a CLI UI, a GUI, or a Web-based UI.

You present your system's answers to the questions above in a Markdown file on your Github account. That is, you hand in this assignment via Github, with one hand-in per group. Push your solution, source, code, and presentation of the results to a Github repository per group and push a link to your solution in the  hand-in area.


## Hints

You can download and uncompress a dataset of Twitter tweets from http://help.sentiment140.com/for-students/.


Connect to your Docker container running the MongoDB DBMS.

```bash
$ docker run --rm -v $(pwd)/data:/data/db --publish=27017:27017 --name dbms -d mongo
88385afac5fe88a5ba47cd60c084bc1855cae6089a7e7d95ba24f0ba6fea1404
$ docker exec -it 88385afa bash
```

On that container install `wget` and `unzip`. **Note**: everything from here on, is a generic description for how to work on a Linux with the `apt` package manager, such as Ubuntu, Debian, etc.

```bash
root@88385afac5fe:/$ apt-get update
root@88385afac5fe:/$ apt-get install -y wget, unzip
```

Continue with downloading the data

```bash
root@88385afac5fe:/$ wget http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip
```

In your VM the unzip package is not installed by default. 

Now you can uncompress the Twitter dataset to your current directory with:

```bash
root@88385afac5fe:/$ unzip trainingandtestdata.zip
```

After uncompression you will have a folder with two files. For your exercise you will use the bigger one training.1600000.processed.noemoticon.csv. However, both files are CSV files, which do not contain a header row. The documentation on http://help.sentiment140.com/for-students/ says that the columns contain the following.

  > The data is a CSV with emoticons removed. Data file format has 6 fields:
  > 0 - the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
  > 1 - the id of the tweet (2087)
  > 2 - the date of the tweet (Sat May 16 23:58:44 UTC 2009)
  > 3 - the query (lyx). If there is no query, then this value is NO_QUERY.
  > 4 - the user that tweeted (robotickilldozr)
  > 5 - the text of the tweet (Lyx is cool)

To make use of the `--headerline` switch when importing the data with mongoimport, we add a headerline accordingly:

```bash
root@88385afac5fe:/# sed -i '1s;^;polarity,id,date,query,user,text\n;' training.1600000.processed.noemoticon.csv
```

That leads to that the fields of our documents are named according to the given headerfields.
After importing the dataset, the dates are represented as strings instead of proper date objects. You might want to convert them with the following code:

```javascript
db.tweets.find().forEach(function(doc){
    if (doc.date instanceof Date !== true) {
        doc.date = new Date(doc.date);
        db.tweets.save(doc);
    }
});
```

For some of the questions, you might want to have a look into MongoDB's aggregation framework and queries using regular expressions. For example, the following query finds all tweets mentioning another Twitter user.

```
db.tweets.aggregate(
    {$match:{text:/@\w+\/}},
    {$group:{_id:null,text:{$push:"$text"}}
})
```


## Hand-in procedure

  * Provide all code and documentation for this assignment in a repository on Github.
  * Create a Markdown (.md) file called README.md in the root of your project.
  * That README.md describes what this project does and how to make it work. That is, you reviewer has to be able to clone your project, build it -you have to define steps for how to do that and what dependencies are required-, and use it.
  * A presentation of your system's reply to the **five** queries above.
  * Hand-in a link to your repository on www.peergrade.io.
  * Hand-in at latest on 10. Feb. 16:00.
  
## Review procedure

  * Log onto www.peergrade.io with your school email addresss.
  * Finish your review on www.peergrade.io at latest on 12. Feb. 23:00
  * Make use of the review criteria in peergrade giving feedback.



# Video

  * https://www.youtube.com/watch?v=1sLjWlWvCsc

# Literature

  * http://www.redbook.io/pdf/redbook-5th-edition.pdf
  * http://15721.courses.cs.cmu.edu/spring2016/papers/whatgoesaround-stonebraker.pdf
  * https://docs.mongodb.com/manual/tutorial/query-documents/
  * https://docs.mongodb.com/manual/reference/
  
  * https://docs.mongodb.com/manual/indexes/
  * https://docs.mongodb.com/manual/reference/program/mongoimport/
