# Data Models


  > Data models are perhaps the most important part of developing software, because they have such a profound effect: not only on how the software is written, but also on how we think about the problem that we are solving.
  >
  > Martin Kleppmann, _Designing Data-Intensive Applications

## What was the Data Model in the Last Lecture and Exercise?

## Information Management System (IMS) - Hierarchical Data Model

  > IBM Information Management System (IMS) is a joint hierarchical database and information management system...
  >
  > The first "IMS READY" message appeared on an IBM 2740 terminal in Downey, California, on 14 August 1968.
  > 
  > https://en.wikipedia.org/wiki/IBM_Information_Management_System
  
It is still in use and sold today: https://www.ibm.com/support/knowledgecenter/zosbasics/com.ibm.imsintro.doc.intro/ip0ind0011003584.htm

## Hierarchical Data Model

![](https://www.tutorialspoint.com/ims_db/images/random_processing.png)
https://www.tutorialspoint.com/ims_db/ims_db_dli_processing.htm

## The Document Data Model


A document is essentially a set of property names and their values. The values can be simple data types, such as strings, numbers, and dates. But these values can also be arrays and even other documents.

```javascript
{
  _id: ObjectID('4bd9e8e17cefd644108961bb'),     // _id field, primary key
  title: 'Adventures in Databases',
  url: 'http://example.com/databases.txt',
  author: 'msmith',
  vote_count: 20,
  tags: ['databases', 'mongodb', 'indexing'],    // Tags stored as array of strings
  image: {                                       // Attribute pointing to another document
    url: 'http://example.com/db.jpg',
    caption: 'A database.',
    type: 'jpg',
    size: 75381,
    data: 'Binary'
  },
  comments: [                                    // Comments stored as array of comment objects
    {
      user: 'bjones',
      text: 'Interesting article.'
    },
    {
      user: 'sverch',
      text: 'Color me skeptical!'
    }
  ]
}
```

Note, a JSON document needs double quotes everywhere except for numeric values. The listing shows the JavaScript version of a JSON document. Internally, MongoDB stores documents in a format called _Binary JSON_, or **BSON**.

## The Document Data Model in MongoDB

On top of documents, MongoDB has the concept of _collections_. _Collections_ can be considered as grouped _documents_.

_Collections_ are similar to tables in the relational world.

The document-oriented data model naturally represents data in an aggregate form, allowing you to work with an object holistically.

## The Object-Relational Mismatch

![](http://www.agiledata.org/images/impedanceMismatchClassDiagram.gif)

![](http://www.agiledata.org/images/impedanceMismatchPDM.gif)

http://www.agiledata.org/essays/impedanceMismatch.html

# Architecture of Databases System for Teaching

## A Database Container

![](images/DB_containers_internal.png)

```bash
$ docker run --rm --publish=27017:27017 --name dbms -d mongo:latest
$ docker run -it --link dbms:mongo --rm mongo sh -c 'exec mongo "$MONGO_PORT_27017_TCP_ADDR:$MONGO_PORT_27017_TCP_PORT/test"'
```


**OBS** Do not do this in production. This is a setup that we will use for experimentation only!

## A Database Container

What is the issue with such a setup?

## Containerized DB Setup for Production

![](images/DB_containers_external.png)


```bash
$ docker run --rm -v $(pwd)/data:/data/db --publish=27017:27017 --name dbms -d mongo:latest
$ docker run -it --link dbms:mongo --rm mongo sh -c 'exec mongo "$MONGO_PORT_27017_TCP_ADDR:$MONGO_PORT_27017_TCP_PORT/test"'
```

## Containerized DB Setup for Production


What is the advantage of such a setup?

In [1]:
%%bash
ls -ltra ../data

total 104
-rw-r--r--@  1 rhp  staff     0 Jan 23 10:20 .gitkeep
drwxr-xr-x  13 rhp  staff   442 Feb  2 15:39 ..
-rw-------   1 rhp  staff   114 Feb  7 18:05 storage.bson
-rw-------   1 rhp  staff  4096 Feb  7 18:05 sizeStorer.wt
-rw-------   1 rhp  staff     2 Feb  7 18:05 mongod.lock
drwx------   5 rhp  staff   170 Feb  7 18:05 journal
-rw-------   1 rhp  staff  4096 Feb  7 18:05 index-3--7174197220278716654.wt
-rw-------   1 rhp  staff  4096 Feb  7 18:05 index-1--7174197220278716654.wt
-rw-------   1 rhp  staff  4096 Feb  7 18:05 collection-2--7174197220278716654.wt
-rw-------   1 rhp  staff  4096 Feb  7 18:05 collection-0--7174197220278716654.wt
-rw-------   1 rhp  staff  4096 Feb  7 18:05 _mdb_catalog.wt
-rw-------   1 rhp  staff  4096 Feb  7 18:05 WiredTigerLAS.wt
-rw-------   1 rhp  staff  4096 Feb  7 18:05 WiredTiger.wt
-rw-------   1 rhp  staff   889 Feb  7 18:05 WiredTiger.turtle
-rw-------   1 rhp  staff    21 Feb  7 18:05 WiredTiger.lock
-rw-------   1 rhp  staff    45 Feb  

## Starting a MongoDB Instance for the Lectures

  * Via a container, see https://hub.docker.com/_/mongo/:
  ```bash
  docker run --rm --publish=27017:27017 --name dbms -d mongo:latest
  ```
  ```bash
  docker run --rm -v $(pwd)/data:/data/db --publish=27017:27017 --name dbms -d mongo:latest
  ```
  * Installation in the provided VM, see https://github.com/datsoftlyngby/soft2018spring-databases-teaching-material
  * Installation of MongoDB on the host machine, see https://docs.mongodb.com/manual/administration/install-community/



## Connecting to MongoDB

  * Via a Mongo shell installed in a container: 
  ```bash
  docker run -it --link dbms:mongo --rm mongo sh -c 'exec mongo "$MONGO_PORT_27017_TCP_ADDR:$MONGO_PORT_27017_TCP_PORT/test"'```
  * Via the Mongo shell installed on a host: 
  ```bash
  mongo --host 127.0.0.1:27017
  ```
  * Via the GUI client RoboMongo, see https://robomongo.org/download
![](https://robomongo.org/static/screens-transparent-6e2a44fd.png)
  * Via your own application, see in the end of the lecture.

# The MongoDB Query Language

  > MongoDB queries are represented as a JSON-like structure, just like documents. To build a query, you specify a document with properties you wish the results to match. MongoDB treats each property as having an implicit boolean AND. It natively supports boolean OR queries, but you must use a special operator ($or) to achieve it. In addition to exact matches, MongoDB has operators for greater than, less than, etc.
  >
  > https://www.safaribooksonline.com/library/view/mongodb-and-python/9781449312817/ch02s06.html
  
The MongoDB documentation calls the Query Language itself _Query Documents_, see https://docs.mongodb.com/manual/tutorial/query-documents/.

## Hov, I think I know everything you want to tell us here! 

![](http://static3.businessinsider.com/image/4fbfb86becad044879000001-506-253/suddenly-startups-have-gotten-very-boring.jpg)

Cool! Then I would like you to ask for your help. Especially, that you help your fellow students by:

  * Figure out how to do CRUD operations in CouchDB
  * Figure out how to do range queries
  * Figure out how to do queries over text content


Get the DBMS quickly:

```bash
docker run -d -p 5984:5984 --name couchdbms couchdb
```

See:
  
  * The official homepage https://couchdb.apache.org
  * An introductory tutorial http://docs.couchdb.org/en/2.1.1/intro/tour.html
  * The image in Dockerhub https://hub.docker.com/_/couchdb/

## Switching to a collection / Creating a new collection

In [None]:
use users

## Select All Documents in a Collection

To select all documents in the collection, pass an empty document as the query filter parameter to the find method.

In [2]:
db.users.find({})

```bash
db.users.find({})
``` 

is synonymous to `db.users.find()`. 

However the former is more explicit and preferred.

In SQL the above query correpsonds to


```sql
SELECT * FROM users
```

## Inserting Data

To be able to query some data in the following, let's first create some in the database.

In [3]:
db.users.insert({username: "Møller", age: 25})

WriteResult({ "nInserted" : 1 })

In [4]:
db.users.find({})

{ "_id" : ObjectId("5a7b31df0b1e5fccffb6e492"), "username" : "Møller", "age" : 25 }

### What is the _id field?

The `_id` value can be considered a document’s primary key. Every MongoDB document requires an `_id`.

If none is present at creation time, a special MongoDB ObjectID will be generated and added to the document.

In [5]:
db.users.insert({username: "Hansen", age: 22})

WriteResult({ "nInserted" : 1 })

In [6]:
db.users.insert({username: "Nielsen", age: 24})

WriteResult({ "nInserted" : 1 })

In [7]:
db.users.find({})

{ "_id" : ObjectId("5a7b31df0b1e5fccffb6e492"), "username" : "Møller", "age" : 25 }
{ "_id" : ObjectId("5a7b31e50b1e5fccffb6e493"), "username" : "Hansen", "age" : 22 }
{ "_id" : ObjectId("5a7b31e70b1e5fccffb6e494"), "username" : "Nielsen", "age" : 24 }

In [8]:
db.users.count()

3

## Query Documents in a Collection

### Matching Selector

A query selector is a document that is used to match against all documents in the collection. 

It specifies the _equality condition_, i.e. fields and values, which must be equal in the documents you want to select.

To specify equality conditions, use `<field>:<value>` expressions in the query selector document:

```
{ <field1>: <value1>, ... }
```

In [9]:
db.users.find({username: "Hansen"})

{ "_id" : ObjectId("5a7b31e50b1e5fccffb6e493"), "username" : "Hansen", "age" : 22 }

That query is equivalent to the following SQL query:

```SQL
SELECT * FROM users WHERE username = "Hansen"
```

In [10]:
db.users.find({ username: "Hansen",
                "age" : 22 })

{ "_id" : ObjectId("5a7b31e50b1e5fccffb6e493"), "username" : "Hansen", "age" : 22 }

The matching selector with various fields and values:

```javascript
db.users.find({ username: "Hansen",
                "age" : 22 })
```

is eqiuvalent to the following with an explicit conjunction (`$and`).

In [11]:
db.users.find({ $and: [ { username: "Hansen" },
                        { "age": 22 } ] })

{ "_id" : ObjectId("5a7b31e50b1e5fccffb6e493"), "username" : "Hansen", "age" : 22 }

In [12]:
db.users.find({ $or: [ { username: "Møller" }, 
                       { username: "Hansen" } ]})

{ "_id" : ObjectId("5a7b31df0b1e5fccffb6e492"), "username" : "Møller", "age" : 25 }
{ "_id" : ObjectId("5a7b31e50b1e5fccffb6e493"), "username" : "Hansen", "age" : 22 }

## The `$or` Operator More Compactly as Containment Check

In [13]:
db.users.find( { username: { $in: [ "Møller", "Hansen" ] } } )

{ "_id" : ObjectId("5a7b31df0b1e5fccffb6e492"), "username" : "Møller", "age" : 25 }
{ "_id" : ObjectId("5a7b31e50b1e5fccffb6e493"), "username" : "Hansen", "age" : 22 }

## Regular Expressions in Matching Queries

You can use reqular expressions in you matching queries in either of the two following forms:

```
db.users.find({ username: /en$/ })
db.users.find({ username: { $regex: "en$" } })
```

In [14]:
db.users.find({ username: /en$/ })

{ "_id" : ObjectId("5a7b31e50b1e5fccffb6e493"), "username" : "Hansen", "age" : 22 }
{ "_id" : ObjectId("5a7b31e70b1e5fccffb6e494"), "username" : "Nielsen", "age" : 24 }

## Range Queries

The Mongo query language supports the following query operators: `$eq`, `$gt`, `$gte`, `$in`, `$lt`, `$lte`, `$ne`, `$nin`

In [15]:
db.users.find( { username: /en$/, 
                 age: { $lte: 24 } } )

{ "_id" : ObjectId("5a7b31e50b1e5fccffb6e493"), "username" : "Hansen", "age" : 22 }
{ "_id" : ObjectId("5a7b31e70b1e5fccffb6e494"), "username" : "Nielsen", "age" : 24 }

## Updating Documents

Generally, there are two types of updates with different semantics and use cases:

  * Updating a single document or many documents, i.e., modification of corresponding fields and values.
  * Replacement of old documents with new ones.
  


### Operator Update

In [16]:
db.users.update( { username: "Møller" }, 
                 { $set: { country: "Denmark" } } )

WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })

In [None]:
db.users.find({username: "Møller"})

### Replacement Update

In [17]:
db.users.update( { username: "Møller" }, 
                 { country: "Canada" } )

WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })

In [18]:
db.users.find( { username: "Møller" } )

In [19]:
db.users.find( { country: "Canada" } )

{ "_id" : ObjectId("5a7b31df0b1e5fccffb6e492"), "country" : "Canada" }

Let's add the username back to the record for our example.

In [20]:
db.users.update( { country: "Canada" }, 
                 { $set: { username: "Møller" } } )

WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })

In [21]:
db.users.find( { country: "Canada" } )

{ "_id" : ObjectId("5a7b31df0b1e5fccffb6e492"), "country" : "Canada", "username" : "Møller" }

### Removing a field from a document

Value can be removed as with the help of the `$unset` operator.

In [22]:
db.users.update( { username: "Møller" }, 
                 { $unset: { country: 1 } } )

WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })

In [23]:
db.users.find( { username: "Møller" } )

{ "_id" : ObjectId("5a7b31df0b1e5fccffb6e492"), "username" : "Møller" }

### Complex Updates

In [24]:
db.users.update( { username: "Møller" }, 
                 {  $set: {
                      favorites: { 
                        restaurant: [ "La Petanque", "Hija de Sanchez" ], 
                        cafe: [ "Paludan Bog & Café", "Café Retro", "Conditori La Glace" ] 
                      }
                    } 
                  })

WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })

In [25]:
db.users.update( { username: "Hansen" }, 
                 {  $set: {
                      favorites: { 
                        cafe: [ "Vaffelbageren", "Café BoPa", "Conditori La Glace" ] 
                      }
                    } 
                 })

WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })

In [26]:
db.users.find({})

{ "_id" : ObjectId("5a7b31df0b1e5fccffb6e492"), "username" : "Møller", "favorites" : { "restaurant" : [ "La Petanque", "Hija de Sanchez" ], "cafe" : [ "Paludan Bog & Café", "Café Retro", "Conditori La Glace" ] } }
{ "_id" : ObjectId("5a7b31e50b1e5fccffb6e493"), "username" : "Hansen", "age" : 22, "favorites" : { "cafe" : [ "Vaffelbageren", "Café BoPa", "Conditori La Glace" ] } }
{ "_id" : ObjectId("5a7b31e70b1e5fccffb6e494"), "username" : "Nielsen", "age" : 24 }

In [27]:
db.users.find( { "favorites.cafe": "Conditori La Glace" } )

{ "_id" : ObjectId("5a7b31df0b1e5fccffb6e492"), "username" : "Møller", "favorites" : { "restaurant" : [ "La Petanque", "Hija de Sanchez" ], "cafe" : [ "Paludan Bog & Café", "Café Retro", "Conditori La Glace" ] } }
{ "_id" : ObjectId("5a7b31e50b1e5fccffb6e493"), "username" : "Hansen", "age" : 22, "favorites" : { "cafe" : [ "Vaffelbageren", "Café BoPa", "Conditori La Glace" ] } }

#### Adding Elements to Nested Sets

Suppose we know that any user who likes _Café Retro_ also likes _Lagkagehuset_ and that our database shall reflect this fact.

##### What does the `false`and the `true` mean?

```javascript
> db.users.update
function (query, obj, upsert, multi) {
...
}
```

In [28]:
db.users.update( { "favorites.cafe": "Café Retro" }, 
                 { $addToSet: { "favorites.cafe": "Lagkagehuset" } }, 
                 false, true)

WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })

## Deleting Documents

### Delete a Single Document 

In [29]:
db.users.remove({"favorites.cafe": "Café Retro"})

WriteResult({ "nRemoved" : 1 })

### Delete all Documents of a Collection

In [30]:
db.users.remove({})

WriteResult({ "nRemoved" : 2 })

In [31]:
db.users.find({})

### Deleting an Entire Collection

In [32]:
db.users.drop()

true

## Getting `help`

In [None]:
help

In [None]:
db.users.help()

# Querying with Indexes

Let's start with creating a large collection of numbers.

You know Javascript. How do I specify a loop from `0` to `20000` in Javascript?

In [33]:
for(i = 0; i < 20000; i++) { 
    db.numbers.save( { num: i } ); 
}

WriteResult({ "nInserted" : 1 })

In [34]:
db.numbers.count()

20000

In [35]:
db.numbers.find( {} ).limit( 5 )

{ "_id" : ObjectId("5a7b32440b1e5fccffb6e495"), "num" : 0 }
{ "_id" : ObjectId("5a7b32440b1e5fccffb6e496"), "num" : 1 }
{ "_id" : ObjectId("5a7b32440b1e5fccffb6e497"), "num" : 2 }
{ "_id" : ObjectId("5a7b32440b1e5fccffb6e498"), "num" : 3 }
{ "_id" : ObjectId("5a7b32440b1e5fccffb6e499"), "num" : 4 }

In [36]:
db.numbers.find( { num: { "$gt": 20, "$lt": 25 } } )

{ "_id" : ObjectId("5a7b32440b1e5fccffb6e4aa"), "num" : 21 }
{ "_id" : ObjectId("5a7b32440b1e5fccffb6e4ab"), "num" : 22 }
{ "_id" : ObjectId("5a7b32440b1e5fccffb6e4ac"), "num" : 23 }
{ "_id" : ObjectId("5a7b32440b1e5fccffb6e4ad"), "num" : 24 }

## Execution Statistics

When any database receives a query, it must plan out how to execute it. This is called a _query plan_.

The `explain` method describes query paths and allows developers to diagnose slow operations by determining which indexes a query has used.

In [37]:
db.numbers.find( { 
                   num: { 
                     "$gt": 19995 
                   } 
                 } ).explain("executionStats")

{
	"queryPlanner" : {
		"plannerVersion" : 1,
		"namespace" : "test.numbers",
		"indexFilterSet" : false,
		"parsedQuery" : {
			"num" : {
				"$gt" : 19995
			}
		},
		"winningPlan" : {
			"stage" : "COLLSCAN",
			"filter" : {
				"num" : {
					"$gt" : 19995
				}
			},
			"direction" : "forward"
		},
		"rejectedPlans" : [ ]
	},
	"executionStats" : {
		"executionSuccess" : true,
		"nReturned" : 4,
		"executionTimeMillis" : 7,
		"totalKeysExamined" : 0,
		"totalDocsExamined" : 20000,
		"executionStages" : {
			"stage" : "COLLSCAN",
			"filter" : {
				"num" : {
					"$gt" : 19995
				}
			},
			"nReturned" : 4,
			"executionTimeMillisEstimate" : 0,
			"works" : 20002,
			"advanced" : 4,
			"needTime" : 19997,
			"needYield" : 0,
			"saveState" : 156,
			"restoreState" : 156,
			"isEOF" : 1,
			"invalidates" : 0,
			"direction" : "forward",
			"docsExamined" : 20000
		}
	},
	"serverInfo" : {
		"host" : "88e2572f678b",
		"port" : 27017,

## Creating an Index in MongoDB

On top of the indexes that you create manually, every collection in MongoDB has an index on the `_id` field, which is created automatically for every collection.

MongoDB indexes use a _B-tree_ data structure. Jens will tell you more about this datastructure.

In [38]:
db.numbers.createIndex( { num: 1 } )

{
	"createdCollectionAutomatically" : false,
	"numIndexesBefore" : 1,
	"numIndexesAfter" : 2,
	"ok" : 1
}

In [39]:
db.numbers.getIndexes()

[
	{
		"v" : 2,
		"key" : {
			"_id" : 1
		},
		"name" : "_id_",
		"ns" : "test.numbers"
	},
	{
		"v" : 2,
		"key" : {
			"num" : 1
		},
		"name" : "num_1",
		"ns" : "test.numbers"
	}
]

In [40]:
db.numbers.find( { 
                   num: { 
                     "$gt": 19995 
                   } 
                 } ).explain("executionStats")

{
	"queryPlanner" : {
		"plannerVersion" : 1,
		"namespace" : "test.numbers",
		"indexFilterSet" : false,
		"parsedQuery" : {
			"num" : {
				"$gt" : 19995
			}
		},
		"winningPlan" : {
			"stage" : "FETCH",
			"inputStage" : {
				"stage" : "IXSCAN",
				"keyPattern" : {
					"num" : 1
				},
				"indexName" : "num_1",
				"isMultiKey" : false,
				"multiKeyPaths" : {
					"num" : [ ]
				},
				"isUnique" : false,
				"isSparse" : false,
				"isPartial" : false,
				"indexVersion" : 2,
				"direction" : "forward",
				"indexBounds" : {
					"num" : [
						"(19995.0, inf.0]"
					]
				}
			}
		},
		"rejectedPlans" : [ ]
	},
	"executionStats" : {
		"executionSuccess" : true,
		"nReturned" : 4,
		"executionTimeMillis" : 4,
		"totalKeysExamined" : 4,
		"totalDocsExamined" : 4,
		"executionStages" : {
			"stage" : "FETCH",
			"nReturned" : 4,
			"executionTimeMillisEstimate" : 0,
			"works" : 5,
			"advanced" : 4,
			"needTime" : 0,
			"

In [41]:
for(i = 20000; i < 50000; i++) { 
    db.numbers.save( { num: i } ); 
}

WriteResult({ "nInserted" : 1 })

In [42]:
db.numbers.find( { 
                   num: { 
                     "$gt": 49995 
                   } 
                 } )

{ "_id" : ObjectId("5a7b32750b1e5fccffb7a7e1"), "num" : 49996 }
{ "_id" : ObjectId("5a7b32750b1e5fccffb7a7e2"), "num" : 49997 }
{ "_id" : ObjectId("5a7b32750b1e5fccffb7a7e3"), "num" : 49998 }
{ "_id" : ObjectId("5a7b32750b1e5fccffb7a7e4"), "num" : 49999 }

# Connect to MongoDB from a Java Maven Project

https://mongodb.github.io/mongo-java-driver/3.0/driver/getting-started/installation-guide/

```xml
<dependency>
    <groupId>org.mongodb</groupId>
    <artifactId>mongodb-driver</artifactId>
    <version>3.0.4</version>
</dependency>
```

Based on https://mongodb.github.io/mongo-java-driver/3.0/driver/getting-started/quick-tour/

```java
package dk.cphbusiness.db.meassurements;

import com.mongodb.MongoClient;
import com.mongodb.MongoClientURI;
import com.mongodb.client.MongoCollection;
import com.mongodb.client.MongoDatabase;
import org.bson.Document;

public class MongoTest {

    public static void main(String[] args) {
        MongoClientURI connStr = new MongoClientURI("mongodb://localhost:27017");
        MongoClient mongoClient = new MongoClient(connStr);

        MongoDatabase db = mongoClient.getDatabase("test-database");
        MongoCollection<Document> collection = db.getCollection("tweets");

        Document myDoc = collection.find().first();
        System.out.println(myDoc.toJson());
    }
}
```

# Importing Data

You can either write a program which inserts documents into a database or you use MongoDB's CLI import tool.

```bash
mongoimport --drop --db social_net --collection tweets --type csv --headerline --file testdata.manual.2009.06.14.csv
```


# Your turn at home!

![](http://www.twenty19.com/blog/wp-content/uploads/2017/07/typing2.gif)

## Assignment 2 - Analysis of Twitter Data

Your task is to implement a small database application, which imports a dataset of Twitter tweets from the CSV file into database.

Your application has to be able to answer queries corresponding to the following questions:

  1. How many Twitter users are in the database?
  2. Which Twitter users link the most to other Twitter users? (Provide the top ten.)
  * Who is are the most mentioned Twitter users? (Provide the top five.)
  * Who are the most active Twitter users (top ten)?
  * Who are the five most grumpy (most negative tweets) and the most happy (most positive tweets)? (Provide five users for each group)

Your application can be written in the language of your choice. It must have a form of UI but it is not important if it is an API, a CLI UI, a GUI, or a Web-based UI.

You present your system's answers to the questions above in a Markdown file on your Github account. That is, you hand in this assignment via Github, with one hand-in per group. Push your solution, source, code, and presentation of the results to a Github repository per group and push a link to your solution in the  hand-in area.


## Hints

You can download and uncompress a dataset of Twitter tweets from http://help.sentiment140.com/for-students/.


Connect to your Docker container running the MongoDB DBMS.

```bash
$ docker run --rm -v $(pwd)/data:/data/db --publish=27017:27017 --name dbms -d mongo
88385afac5fe88a5ba47cd60c084bc1855cae6089a7e7d95ba24f0ba6fea1404
$ docker exec -it 88385afa bash
```

On that container install `wget` and `unzip`. **Note**: everything from here on, is a generic description for how to work on a Linux with the `apt` package manager, such as Ubuntu, Debian, etc.

```bash
root@88385afac5fe:/$ apt-get update
root@88385afac5fe:/$ apt-get install -y wget, unzip
```

Continue with downloading the data

```bash
root@88385afac5fe:/$ wget http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip
```

In your VM the unzip package is not installed by default. 

Now you can uncompress the Twitter dataset to your current directory with:

```bash
root@88385afac5fe:/$ unzip trainingandtestdata.zip
```

After uncompression you will have a folder with two files. For your exercise you will use the bigger one training.1600000.processed.noemoticon.csv. However, both files are CSV files, which do not contain a header row. The documentation on http://help.sentiment140.com/for-students/ says that the columns contain the following.

  > The data is a CSV with emoticons removed. Data file format has 6 fields:
  > 0 - the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
  > 1 - the id of the tweet (2087)
  > 2 - the date of the tweet (Sat May 16 23:58:44 UTC 2009)
  > 3 - the query (lyx). If there is no query, then this value is NO_QUERY.
  > 4 - the user that tweeted (robotickilldozr)
  > 5 - the text of the tweet (Lyx is cool)

To make use of the `--headerline` switch when importing the data with mongoimport, we add a headerline accordingly:

```bash
root@88385afac5fe:/# sed -i '1s;^;polarity,id,date,query,user,text\n;' training.1600000.processed.noemoticon.csv
```

That leads to that the fields of our documents are named according to the given headerfields.
After importing the dataset, the dates are represented as strings instead of proper date objects. You might want to convert them with the following code:

```javascript
db.tweets.find().forEach(function(doc){
    if (doc.date instanceof Date !== true) {
        doc.date = new Date(doc.date);
        db.tweets.save(doc);
    }
});
```

For some of the questions, you might want to have a look into MongoDB's aggregation framework and queries using regular expressions. For example, the following query finds all tweets mentioning another Twitter user.

```
db.tweets.aggregate(
    {$match:{text:/@\w+\/}},
    {$group:{_id:null,text:{$push:"$text"}}
})
```


## Hand-in procedure

  * Provide all code and documentation for this assignment in a repository on Github.
  * Create a Markdown (.md) file called README.md in the root of your project.
  * That README.md describes what this project does and how to make it work. That is, you reviewer has to be able to clone your project, build it -you have to define steps for how to do that and what dependencies are required-, and use it.
  * A presentation of your system's reply to the **five** queries above.
  * Hand-in a link to your repository on www.peergrade.io.
  * Hand-in at latest on 12. Feb. 23:55.
  
## Review procedure

  * Log onto www.peergrade.io with your school email addresss.
  * Finish your review on www.peergrade.io at latest on 14. Feb. 12:00 (noon).
  * Make use of the review criteria below when giving feedback.


# Video

  * https://www.youtube.com/watch?v=1sLjWlWvCsc

# Literature

  * http://www.redbook.io/pdf/redbook-5th-edition.pdf
  * http://15721.courses.cs.cmu.edu/spring2016/papers/whatgoesaround-stonebraker.pdf
  * https://docs.mongodb.com/manual/tutorial/query-documents/
  * https://docs.mongodb.com/manual/reference/
  
  * https://docs.mongodb.com/manual/indexes/
  * https://docs.mongodb.com/manual/reference/program/mongoimport/
