# Today

  1. Aggregation pipeline
  2. Data Modeling Concepts

# Aggregation pipeline

Aggregation is used for (at least):
* Finding the sum (or average) of a selected field in all documents
* Finding the sum (or average) of a selected field in some of documents
* Grouping the documents, and then computing the sum of all documents within the group
* Sorting within the group
* Picking the the first N documents

The mongodb query `db.mycollection.aggregate()` function to the resque.

# What is a mongodb collection

Cut to the bone, it is a JSON array of the form:

```jso`
[ 
    {_id:ID_1, propName_1: some_value, propName_2: some_value, ....},
    {_id:ID_2, propName_1: some_value, propName_17: some_value, ....},
    {_id:ID_3, propName_2: some_value, propName_1: some_value, ....},
    ...
]
```
Each **object** is called a **document**, and **must** have a property named "_id".<br>

Besides the array, a mongodb collection has indexes and other aspects which enable mongodb to handle them efficiently.

# Stages
The mongodb aggregation pipeline consists of a number of **stages**. 

Each stage takes as input one collection, and produce an other collection. For example:

* `$match`. Filters the document stream to allow only matching documents to pass unmodified into the next pipeline stage.
* `$sort`. Reorders the document stream by a specified sort key. Only the order changes; the documents remain unmodified. For each input document, outputs one document.
* `$project`. Reshapes each document in the stream, such as by adding new fields or removing existing fields. For each input document, outputs one document.


# Setting up the book collection

In [2]:
%%bash
docker container ls

CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                      NAMES
cc884c1600a3        mongo:latest        "docker-entrypoint.s…"   22 hours ago        Up 22 hours         0.0.0.0:27017->27017/tcp   mymongo


In [None]:
%%bash
docker run --rm -v /Users/kasper/PlayGround/Jupyter/data:/data/db --publish=27017:27017 --name mymongo -d mongo:latest

In [12]:
import pymongo
from pymongo import MongoClient
client = MongoClient()
db = client.soft
users = db.users
print("Done")

Done


In [13]:
from urllib.request import urlopen
import json
from bson.json_util import loads

link = "https://raw.githubusercontent.com/ozlerhakan/mongodb-json-files/master/datasets/catalog.books.json"
f = urlopen(link)
myfile = f.read()
allBooks = myfile.decode("utf-8")
count = 0
for line in allBooks.splitlines():
    jsonbook = loads(line)
    #print( str(count) +": " + str(jsonbook) )
    db.books.insert_one(jsonbook)
    count = count + 1
"Read " + str(count)

'Read 431'

# $limit

In its simplest form it is like this:

In [2]:
%%bash
query='db.books.aggregate([{$limit:3} ])'
docker exec mymongo mongo soft --eval "$query"

MongoDB shell version v4.0.5
connecting to: mongodb://127.0.0.1:27017/soft?gssapiServiceName=mongodb
Implicit session: session { "id" : UUID("4d93011c-972c-4506-a3ed-8ecf37ceb474") }
MongoDB server version: 4.0.5
{ "_id" : 1, "title" : "Unlocking Android", "isbn" : "1933988673", "pageCount" : 416, "publishedDate" : ISODate("2009-04-01T07:00:00Z"), "thumbnailUrl" : "https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ.book-thumb-images/ableson.jpg", "shortDescription" : "Unlocking Android: A Developer's Guide provides concise, hands-on instruction for the Android operating system and development tools. This book teaches important architectural concepts in a straightforward writing style and builds on this with practical and useful examples throughout.", "longDescription" : "Android is an open source mobile phone platform based on the Linux operating system and developed by the Open Handset Alliance, a consortium of over 30 hardware, software and telecom companies that focus on open standards f

# $project
Passes along the documents with the requested fields to the next stage in the pipeline. The specified fields can be existing fields from the input documents or newly computed fields.


In [16]:
%%bash
query='db.books.aggregate([
{$project: {title:1, pageCount:1, isbn:1, _id:0}},
{$limit:3} 
])'
docker exec mymongo mongo soft --eval "$query"

MongoDB shell version v4.0.5
connecting to: mongodb://127.0.0.1:27017/soft?gssapiServiceName=mongodb
Implicit session: session { "id" : UUID("f85ab801-8743-4d70-836c-c04c58deeb64") }
MongoDB server version: 4.0.5
{ "title" : "Unlocking Android", "isbn" : "1933988673", "pageCount" : 416 }
{ "title" : "Android in Action, Second Edition", "isbn" : "1935182722", "pageCount" : 592 }
{ "title" : "Specification by Example", "isbn" : "1617290084", "pageCount" : 0 }


### Why is the order og the properties not the same as in the query?
### Does it matter?

In [18]:
%%bash
query='db.books.aggregate([
{$limit:3}, 
{$project: {title:1, 
            noAuthors:{$size: "$authors"}, 
            isAndroid: {$gte: [{$indexOfCP: ["$title","Android"]},0]}
            }}
])'
docker exec mymongo mongo soft --eval "$query"

MongoDB shell version v4.0.5
connecting to: mongodb://127.0.0.1:27017/soft?gssapiServiceName=mongodb
Implicit session: session { "id" : UUID("36e33e2a-4c1d-4d3b-85a5-25a581a54759") }
MongoDB server version: 4.0.5
{ "_id" : 1, "title" : "Unlocking Android", "noAuthors" : 3, "isAndroid" : true }
{ "_id" : 2, "title" : "Android in Action, Second Edition", "noAuthors" : 2, "isAndroid" : true }
{ "_id" : 3, "title" : "Specification by Example", "noAuthors" : 1, "isAndroid" : false }


### Aggregation operators

There are a large number of operators (like `$size`). The list is on

https://docs.mongodb.com/manual/reference/operator/aggregation/

**Warning** - this is a huge aspect of complexity, and new operators are added each version

# $unwind

You specify the name on an array propety.

If an input document contains a property which is an array, the `$unwind` stage outputs one document per element in the array. 

In [25]:
%%bash
query='db.books.aggregate([
{$limit:1},
{$project: {title:1, authors:1}},
{$unwind: "$authors"}
])'
docker exec mymongo mongo soft --eval "$query"

MongoDB shell version v4.0.5
connecting to: mongodb://127.0.0.1:27017/soft?gssapiServiceName=mongodb
Implicit session: session { "id" : UUID("0af37cfb-e15c-4585-9029-b9f4042b6ea2") }
MongoDB server version: 4.0.5
{ "_id" : 1, "title" : "Unlocking Android", "authors" : "W. Frank Ableson" }
{ "_id" : 1, "title" : "Unlocking Android", "authors" : "Charlie Collins" }
{ "_id" : 1, "title" : "Unlocking Android", "authors" : "Robi Sen" }


# $match

Filters the documents to pass only the documents that match the specified condition(s) to the next pipeline stage.


In [26]:
%%bash
query='db.books.aggregate([
{$match:{authors:[ "Gojko Adzic" ]}},
{$project: {title:1, authors:1}}
])'
docker exec mymongo mongo soft --eval "$query"

MongoDB shell version v4.0.5
connecting to: mongodb://127.0.0.1:27017/soft?gssapiServiceName=mongodb
Implicit session: session { "id" : UUID("d2e1ad38-cd36-4a26-80b4-7cb30612ca3b") }
MongoDB server version: 4.0.5
{ "_id" : 3, "title" : "Specification by Example", "authors" : [ "Gojko Adzic" ] }


In [28]:
%%bash
query='db.books.aggregate([
{$match:{authors: {$in: [ "Robi Sen"] } } },
{$project: {title:1, authors:1}}
])'
docker exec mymongo mongo soft --eval "$query"

MongoDB shell version v4.0.5
connecting to: mongodb://127.0.0.1:27017/soft?gssapiServiceName=mongodb
Implicit session: session { "id" : UUID("16433449-34a4-4959-932e-26eb510e1df3") }
MongoDB server version: 4.0.5
{ "_id" : 1, "title" : "Unlocking Android", "authors" : [ "W. Frank Ableson", "Charlie Collins", "Robi Sen" ] }
{ "_id" : 2, "title" : "Android in Action, Second Edition", "authors" : [ "W. Frank Ableson", "Robi Sen" ] }
{ "_id" : 514, "title" : "Android in Action, Third Edition", "authors" : [ "W. Frank Ableson", "Robi Sen", "Chris King", "C. Enrique Ortiz" ] }


In [32]:
%%bash
query='db.books.aggregate([
{$match:{categories: {$in: [ "Java" ] } } },
{$project: {title:1, authors:1}},
{$limit:10}
])'
docker exec mymongo mongo soft --eval "$query"

MongoDB shell version v4.0.5
connecting to: mongodb://127.0.0.1:27017/soft?gssapiServiceName=mongodb
Implicit session: session { "id" : UUID("e4ca486a-5202-4fa5-9d9e-88757c24dd92") }
MongoDB server version: 4.0.5
{ "_id" : 2, "title" : "Android in Action, Second Edition", "authors" : [ "W. Frank Ableson", "Robi Sen" ] }
{ "_id" : 9, "title" : "Griffon in Action", "authors" : [ "Andres Almiray", "Danno Ferrin", "", "James Shingler" ] }
{ "_id" : 10, "title" : "OSGi in Depth", "authors" : [ "Alexandre de Castro Alves" ] }
{ "_id" : 21, "title" : "3D User Interfaces with Java 3D", "authors" : [ "Jon Barrilleaux" ] }
{ "_id" : 22, "title" : "Hibernate in Action", "authors" : [ "Christian Bauer", "Gavin King" ] }
{ "_id" : 23, "title" : "Hibernate in Action (Chinese Edition)", "authors" : [ "Christian Bauer", "Gavin King" ] }
{ "_id" : 24, "title" : "Java Persistence with Hibernate", "authors" : [ "Christian Bauer", "Gavin King" ] }
{ "_id" : 28, "title" : "Hibernate Search in Action", "aut

# Your turn

In pairs or three.

1. Get the books collection into your mongo database.
    - The data is at: https://raw.githubusercontent.com/ozlerhakan/mongodb-json-files/master/datasets/catalog.books.json
    - either mongoimport them, or use the little python script from last time
2. Construct some queries which:
    - lists the title and authors for all Java books with at least three authors
    - lists all title and authors for all Java books with Android in the title
    - using the `$sort` stage to list title and authors of books with the most authors (the top 10)

In [42]:
%%bash
query='db.books.aggregate([
{$match:{categories: {$in: [ "Java" ] } } },
{$project: {title:1, authors:1,
            noAuthorsLargerThanThree:{ $gte: [{$size: "$authors"},3]},
            isAndroid: {$gte: [{$indexOfCP: ["$title","Android"]},0]}
            }},
{$match: {noAuthorsLargerThanThree:true, isAndroid:true}},
{$project: {title:1, authors:1}},
{$limit:3}
])'
docker exec mymongo mongo soft --eval "$query"

MongoDB shell version v4.0.5
connecting to: mongodb://127.0.0.1:27017/soft?gssapiServiceName=mongodb
Implicit session: session { "id" : UUID("8db29d32-7aa4-4071-a3a1-8e3c057d590a") }
MongoDB server version: 4.0.5


# $group

Groups documents by some specified expression and outputs to the next stage a document for each distinct grouping. 

The output documents contain an `_id` field which contains the distinct group by key. 

The output documents can also contain _computed_ fields that hold the values of some **accumulator expression** grouped by the `$group`’s `_id` field. 

`$group` does not order its output documents.

In [63]:
%%bash
query='db.books.aggregate([
{$project: {title:1, noAuthors:{$size: "$authors"} } },
{$group : {_id:"$noAuthors", number_of_books: {$sum:1} } },
{$sort: {"_id":1}}
])'
docker exec mymongo mongo soft --eval "$query"

MongoDB shell version v4.0.5
connecting to: mongodb://127.0.0.1:27017/soft?gssapiServiceName=mongodb
Implicit session: session { "id" : UUID("e8b717ea-9a14-4299-a899-68402fd181ef") }
MongoDB server version: 4.0.5
{ "_id" : 0, "number_of_books" : 37 }
{ "_id" : 1, "number_of_books" : 206 }
{ "_id" : 2, "number_of_books" : 105 }
{ "_id" : 3, "number_of_books" : 9 }
{ "_id" : 4, "number_of_books" : 47 }
{ "_id" : 5, "number_of_books" : 16 }
{ "_id" : 6, "number_of_books" : 6 }
{ "_id" : 7, "number_of_books" : 2 }
{ "_id" : 8, "number_of_books" : 3 }


### Renaming the _id field

In [67]:
%%bash
query='db.books.aggregate([
{$project: {title:1, noAuthors:{$size: "$authors"} } },
{$group : {_id:"$noAuthors", number_of_books: {$sum:1} } },
{$project: {_id:0,number_of_authors: "$_id", number_of_books:"$number_of_books"}},
{$sort: {"number_of_authors":1}}
])'
docker exec mymongo mongo soft --eval "$query"

MongoDB shell version v4.0.5
connecting to: mongodb://127.0.0.1:27017/soft?gssapiServiceName=mongodb
Implicit session: session { "id" : UUID("f858b0c4-3ff6-415b-99e1-50d6bc67ef18") }
MongoDB server version: 4.0.5
{ "number_of_authors" : 0, "number_of_books" : 37 }
{ "number_of_authors" : 1, "number_of_books" : 206 }
{ "number_of_authors" : 2, "number_of_books" : 105 }
{ "number_of_authors" : 3, "number_of_books" : 9 }
{ "number_of_authors" : 4, "number_of_books" : 47 }
{ "number_of_authors" : 5, "number_of_books" : 16 }
{ "number_of_authors" : 6, "number_of_books" : 6 }
{ "number_of_authors" : 7, "number_of_books" : 2 }
{ "number_of_authors" : 8, "number_of_books" : 3 }


### Counting number of books in each category

In [55]:
%%bash
query='db.books.aggregate([
{$unwind:"$categories"},
{$group: {_id:"$categories", noOfBooksInCategory: {$sum:1}}},
{$sort: {"noOfBooksInCategory":-1}},
{$limit: 10}
])'
docker exec mymongo mongo soft --eval "$query"

MongoDB shell version v4.0.5
connecting to: mongodb://127.0.0.1:27017/soft?gssapiServiceName=mongodb
Implicit session: session { "id" : UUID("90e046af-1baa-42e8-8f45-1923f57d08e9") }
MongoDB server version: 4.0.5
{ "_id" : "Java", "noOfBooksInCategory" : 96 }
{ "_id" : "Internet", "noOfBooksInCategory" : 41 }
{ "_id" : "Microsoft .NET", "noOfBooksInCategory" : 34 }
{ "_id" : "Web Development", "noOfBooksInCategory" : 17 }
{ "_id" : "Software Engineering", "noOfBooksInCategory" : 16 }
{ "_id" : "Business", "noOfBooksInCategory" : 12 }
{ "_id" : "Programming", "noOfBooksInCategory" : 12 }
{ "_id" : "Client-Server", "noOfBooksInCategory" : 11 }
{ "_id" : "Microsoft", "noOfBooksInCategory" : 8 }
{ "_id" : "Theory", "noOfBooksInCategory" : 7 }


# Data Modeling Concepts

  > Database schema design is the process of choosing the best representation for a data set, given the features of the database system, the nature of the data, and the application requirements.
  >
  > _MongoDB in Action_

  > The key challenge in data modeling is balancing the needs of the application, the performance characteristics of the database engine, and the data retrieval patterns. When designing data models, always consider the application usage of the data (i.e. queries, updates, and processing of the data) as well as the inherent structure of the data itself.
  >
  > https://docs.mongodb.com/manual/core/data-modeling-introduction/
  


## Embedded Data Models - Embedded Documents

![](https://docs.mongodb.com/manual/_images/data-model-denormalized.bakedsvg.svg)

Use embedded data models for:

  * modeling _contains_ relationships between entities.
  * modeling _one-to-many_ relationships between entities.

## Normalized Data Models - References

![](https://docs.mongodb.com/manual/_images/data-model-normalized.bakedsvg.svg)

Use normalized data models:

  * when _embedding_ would result in duplication of data but would not provide sufficient read performance advantages to outweigh the implications of the duplication.
  * to model large _hierarchical data sets_

https://docs.mongodb.com/manual/core/data-model-design/

# Kasper's sheep farm

![](http://maarumlam.dk/____impro/1/onewebmedia/Bagside2.JPG?etag=W%2F%226855b-5760176c%22&sourceContentType=image%2Fjpeg&ignoreAspectRatio&resize=868%2B829&quality=85)

# Relational datamodel for sheeps

![](images/Sheeps.png)

#### How to model the relations in mongo?
#### Which are our possibilities?

### What to consider under data modelling

  * Atomicity of Write Operations
  * Document Growth
  * Data Use and Performance
  * Collection Growth 
  * Indexes
  * Collection Contains Large Number of Small Documents
  * Data Lifecycle Management
  * Sharding

https://docs.mongodb.com/manual/core/data-model-operations/

## Sharding?

![](https://docs.mongodb.com/manual/_images/sharding-range-based.bakedsvg.svg)

# Handin


See exercises at:

https://github.com/datsoftlyngby/soft2019spring-databases/blob/master/assignments/assignment3.md
