# GDB LE5: Nicht-relationale Datenbanken lesen

## Aufgabe
Du solltest ein Datenset auswählen, der **besser** in einer NoSQL Datenbank (als in einer relationalen) passt und dich mit der Datenbank und mit den Daten auseinandersetzen.

Damit wir Deinen Weg und Lernerfolg einschätzen können benötigen wir von Dir eine kurze Zusammenfassung.

In dieser Zusammenfassung sollen folgende Fragen beantwortet werden:

1. Wie sehen die Daten aus?
2. Welche Datenbank hast Du gewählt?
3. Warum hast Du diese gewählt?
4. Inwiefern ist das besser als eine relationale Datenbank?
5. Wie sehen komplexe Fragestellungen (Abfrage) zu den Daten aus, und warum sind sie komplex? Vergleiche es mit eine SQL, würde es komplizierter sein, welche Vorteile gibt es gegenüber SQL?

### NoSQL Datenbank Entscheidung

MongoDB, weil...

### Datenset Entscheidung

Analytics, weil...

In [145]:
from pymongo import MongoClient
import urllib
import pandas
import configparser
import pprint

config = configparser.ConfigParser()
config.read('config.ini')

db_username = config["MongoDB"]["username"]
db_password = config["MongoDB"]["password"]
db_hostname = config["MongoDB"]["hostname"]
db_link = config["MongoDB"]["link"]

jsonp = pprint.pprint

In [146]:
!curl ipecho.net/plain

147.86.207.8


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100    12  100    12    0     0     88      0 --:--:-- --:--:-- --:--:--    88


Connect to the Mongo Atlas

In [147]:
client = MongoClient('mongodb+srv://{}:{}@{}.{}.mongodb.net/sample_analytics'.format(db_username, db_password, db_hostname, db_link))
db = client.sample_analytics

mongodb+srv://dbuser:dbuserpw@cluster0.gshpm.mongodb.net/sample_analytics


Let's list which collections are in this database

In [90]:
db.list_collection_names()

['customers', 'transactions', 'accounts']

Let's get the account collection.

In [92]:
accounts = db.accounts

Let's get the data from it.

In [50]:
result = accounts.find()

Interestingly the result is not a collection: https://docs.mongodb.com/manual/reference/method/db.collection.find/index.html

In [51]:
type(result)

pymongo.cursor.Cursor

dir will tell you that result( Cursor) has a method next, which can be used to iterate through, let's iterate. Be aware, after iterating, the result will be empty (so you could save the data in a list).

In [52]:
for i1,res in enumerate(result):
  if i1%100==0:
    print(res)
print(i1)

{'_id': ObjectId('5ca4bbc7a2dd94ee5816238c'), 'account_id': 371138, 'limit': 9000, 'products': ['Derivatives', 'InvestmentStock']}
{'_id': ObjectId('5ca4bbc7a2dd94ee581623f2'), 'account_id': 168924, 'limit': 10000, 'products': ['InvestmentFund', 'CurrencyService', 'InvestmentStock']}
{'_id': ObjectId('5ca4bbc7a2dd94ee58162457'), 'account_id': 951849, 'limit': 10000, 'products': ['Brokerage', 'InvestmentStock']}
{'_id': ObjectId('5ca4bbc7a2dd94ee581624bb'), 'account_id': 66698, 'limit': 10000, 'products': ['InvestmentStock', 'Commodity', 'Derivatives', 'CurrencyService']}
{'_id': ObjectId('5ca4bbc7a2dd94ee5816251f'), 'account_id': 136139, 'limit': 10000, 'products': ['CurrencyService', 'InvestmentFund', 'InvestmentStock']}
{'_id': ObjectId('5ca4bbc7a2dd94ee58162583'), 'account_id': 785218, 'limit': 10000, 'products': ['Commodity', 'InvestmentStock']}
{'_id': ObjectId('5ca4bbc7a2dd94ee581625e7'), 'account_id': 145588, 'limit': 10000, 'products': ['Derivatives', 'InvestmentFund', 'Investm

Now lets use mongodb to acquire the data we need. Here we will aggregate two collections, accounts with transactions. Further, we use the account_id from accounts and link to the foreignField account_id from transactions, similarly as "join on" in sql. We rename that transactions_link. We will also select only those records which have account_id less than 370000. 

In [80]:
result = accounts.aggregate([
    {
        '$lookup': {
            'from': 'transactions', 
            'localField': 'account_id', 
            'foreignField': 'account_id', 
            'as': 'transactions_link'
        }
    }, {
        '$match': {
            '$expr': {
                '$gt': [
                    370000, '$account_id'
                ]
            }
        }
    }
])

Let's check the result.

In [81]:
for res in result:
  print(res["account_id"],len(res["transactions_link"][0]["transactions"]))

198100 66
278603 83
328304 62
260499 94
135185 47
299072 46
137994 65
212024 32
353465 74
324287 21
276528 47
209363 56
136137 51
304914 65
358213 89
228290 66
55958 10
260799 92
87389 23
236908 95
240640 97
330318 72
226398 46
161714 35
344885 27
278497 79
53124 41
168924 53
299100 52
166084 73
126444 76
165706 10
328627 85
332179 99
199559 21
76399 53
159243 80
275355 99
329562 99
160912 68
101383 28
261796 54
300405 35
59715 93
261248 33
118127 43
255695 17
98267 36
139582 93
149440 8
296866 96
316726 25
278866 100
78388 28
330961 27
126668 58
161460 18
347313 54
199962 26
312052 27
175894 96
139687 91
120556 75
357510 24
348352 33
126833 27
155475 8
293516 5
116508 81
292314 8
202669 61
226865 53
170945 66
130514 27
346408 37
264514 95
356905 88
323636 27
116390 2
337979 35
325377 23
304450 81
156715 98
165436 27
244662 67
120472 88
176639 22
141597 3
103536 48
364643 29
155111 40
199711 69
130717 34
59768 98
55104 78
54368 32
86702 20
62872 54
327942 55
358133 13
54977 15
162007 3

Same story but we could the results.

In [55]:
result = accounts.aggregate([
    {
        '$lookup': {
            'from': 'transactions', 
            'localField': 'account_id', 
            'foreignField': 'account_id', 
            'as': 'transactions_link'
        }
    }, {
        '$match': {
            '$expr': {
                '$gt': [
                    370000, '$account_id'
                ]
            }
        }
    }, {
        '$count': 'account_id'
    }
])

In [56]:
for res in result:
  print(res)

{'account_id': 588}


If you want check the results by putting transactions and accounts into two different pandas dataframes and merging them with sql.
For the NoSQL-report the MongoDB query is needed. Using the syntax in the find is enough. You could double check using the pandas approach though.

**Weiterführende Fragen:**

Kannst du auswählen, welche account_id transactions mit weniger als 30 Elemente haben?
Welche account hat die älteste transactions?
Überlege noch 1-2 komplexe Fragen in diesem Datensatz.

### Aufgabe 1 - Wie sehen die Daten aus?

Um unsere Daten besser analysieren zu können müssen wir die Daten zuerst verstehen.

In [60]:
cursor = db.accounts

In [149]:
# Attribute in der Tabelle "accounts" ausgeben
for document in cursor.find():
    for i in document:
        print(i)

_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
accoun

products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_

products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_id
account_id
limit
products
_

Attribute:
- id
- account_id
- limit
- products

### Fragen:

1. Wie viele Kunden haben eine Limite unter 10000?
2. Wie viele Kunden haben "Investment Stock"?
4. Was für Produkte hat Kunde '446747'?
3. Was ist das meist vorkommender Produkt?

In [53]:
# Frage 1

count = 0
for i in cursor.find({"limit": {"$lt": 10000}}):
    count += 1
    
print("Wie viele Kunden haben eine Limite unter 10000?\n- {}".format(count))

Wie viele Kunden haben eine Limite unter 10000?
- 45


In [54]:
# Frage 2

count = 0
for res in cursor.find({"products": "InvestmentStock"}):
    count += 1

print("Wie viele Kunden haben 'Investment Stock'?\n- {}".format(count))

Wie viele Kunden haben 'Investment Stock'?
- 1746


In [78]:
# Frage 3

products = []
for i in cursor.find({"account_id": 446747}):
    for x in i["products"]:
        products.append(x)

print("Was für Produkte hat Kunde '446747'?")
for i in products:
    print(i)

Was für Produkte hat Kunde '446747'?
Derivatives
Commodity
Brokerage
InvestmentStock


### Fragen über Customers:

1. Was ist die Durchschnittsalter der Kunden?
2. ...
3. ...

In [150]:
cursor = db.customers

In [151]:
# Attribute in der Tabelle "customers" ausgeben
for document in cursor.find():
    for i in document:
        print(i)

_id
username
name
address
birthdate
email
active
accounts
tier_and_details
_id
username
name
address
birthdate
email
accounts
tier_and_details
_id
username
name
address
birthdate
email
accounts
tier_and_details
_id
username
name
address
birthdate
email
accounts
tier_and_details
_id
username
name
address
birthdate
email
accounts
tier_and_details
_id
username
name
address
birthdate
email
accounts
tier_and_details
_id
username
name
address
birthdate
email
accounts
tier_and_details
_id
username
name
address
birthdate
email
accounts
tier_and_details
_id
username
name
address
birthdate
email
accounts
tier_and_details
_id
username
name
address
birthdate
email
accounts
tier_and_details
_id
username
name
address
birthdate
email
accounts
tier_and_details
_id
username
name
address
birthdate
email
accounts
tier_and_details
_id
username
name
address
birthdate
email
accounts
tier_and_details
_id
username
name
address
birthdate
email
accounts
tier_and_details
_id
username
name
address
birthdate
email

name
address
birthdate
email
accounts
tier_and_details
_id
username
name
address
birthdate
email
accounts
tier_and_details
_id
username
name
address
birthdate
email
accounts
tier_and_details
_id
username
name
address
birthdate
email
accounts
tier_and_details
_id
username
name
address
birthdate
email
accounts
tier_and_details
_id
username
name
address
birthdate
email
accounts
tier_and_details
_id
username
name
address
birthdate
email
accounts
tier_and_details
_id
username
name
address
birthdate
email
accounts
tier_and_details
_id
username
name
address
birthdate
email
accounts
tier_and_details
_id
username
name
address
birthdate
email
accounts
tier_and_details
_id
username
name
address
birthdate
email
accounts
tier_and_details
_id
username
name
address
birthdate
email
accounts
tier_and_details
_id
username
name
address
birthdate
email
accounts
tier_and_details
_id
username
name
address
birthdate
email
accounts
tier_and_details
_id
username
name
address
birthdate
email
accounts
tier_and_d

Attribute:
- id
- username
- name
- address
- birthdate
- email
- active
- accounts
- tier_and_details

In [153]:
cursor = db.transactions

In [154]:
for document in cursor.find():
    for i in document:
        print(i)

_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_

bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account

bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transa

transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_count
bucket_start_date
bucket_end_date
transactions
_id
account_id
transaction_

Attribute:

- id
- account id
- transaction count
- bucket start date
- bucket end date
- transactions