# GDB LE5: Nicht-relationale Datenbanken lesen

## Aufgabe
Du solltest ein Datenset auswählen, der **besser** in einer NoSQL Datenbank (als in einer relationalen) passt und dich mit der Datenbank und mit den Daten auseinandersetzen.

Damit wir Deinen Weg und Lernerfolg einschätzen können benötigen wir von Dir eine kurze Zusammenfassung.

In dieser Zusammenfassung sollen folgende Fragen beantwortet werden:

- Wie sehen die Daten aus?
- Welche Datenbank hast Du gewählt?
- Warum hast Du diese gewählt?
- Inwiefern ist das besser als eine relationale Datenbank?
- Wie sehen komplexe Fragestellungen (Abfrage) zu den Daten aus, und warum sind sie komplex? Vergleiche es mit eine SQL, würde es komplizierter sein, welche Vorteile gibt es gegenüber SQL?

### NoSQL Datenbank Entscheidung

MongoDB, weil...

### Datenset Entscheidung

Analytics, weil...

In [1]:
from pymongo import MongoClient
import urllib
import pandas
import configparser
import pprint

db_username = 'dbuser'
db_password = 'dbuserpw'
db_hostname = 'cluster0'

Connect to the Mongo Atlas

In [14]:
client = MongoClient('mongodb+srv://{}:{}@{}.gshpm.mongodb.net/sample_analytics'.format(db_username, db_password, db_hostname))
db = client.sample_analytics

With the following code you can check the type and which attributes/functions you can do with the object/instance "db".


In [15]:
print(type(db),dir(db))

<class 'pymongo.database.Database'> ['_BaseObject__codec_options', '_BaseObject__read_concern', '_BaseObject__read_preference', '_BaseObject__write_concern', '_Database__client', '_Database__incoming_copying_manipulators', '_Database__incoming_manipulators', '_Database__name', '_Database__outgoing_copying_manipulators', '_Database__outgoing_manipulators', '__call__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__next__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_apply_incoming_copying_manipulators', '_apply_incoming_manipulators', '_command', '_create_or_update_user', '_current_op', '_default_role', '_fix_incoming', '_fix_outgoing', '_list_collections', '_read_preference_for', '_retryable

Let's list which collections are in this database

In [16]:
db.list_collection_names()

['customers', 'transactions', 'accounts']

Just checking the status of the server and connections.

In [17]:
client.admin.command("serverStatus")

{'host': 'cluster0-shard-00-02.gshpm.mongodb.net:27017',
 'version': '4.4.10',
 'process': 'mongod',
 'pid': 268508,
 'uptime': 575519.0,
 'uptimeMillis': 575518639,
 'uptimeEstimate': 575518,
 'localTime': datetime.datetime(2021, 10, 26, 8, 55, 4, 585000),
 'connections': {'current': 4, 'available': 496, 'totalCreated': 57},
 'extra_info': {'note': 'fields vary by platform', 'page_faults': 0},
 'network': {'bytesIn': 319336551, 'bytesOut': 21486707, 'numRequests': 1291},
 'opcounters': {'insert': 424057,
  'query': 12,
  'update': 0,
  'delete': 0,
  'getmore': 4,
  'command': 837},
 'opcountersRepl': {'insert': 0,
  'query': 0,
  'update': 0,
  'delete': 0,
  'getmore': 0,
  'command': 0},
 'repl': {'topologyVersion': {'processId': ObjectId('616efa4979e5c6643da3212d'),
   'counter': 6},
  'hosts': ['cluster0-shard-00-00.gshpm.mongodb.net:27017',
   'cluster0-shard-00-01.gshpm.mongodb.net:27017',
   'cluster0-shard-00-02.gshpm.mongodb.net:27017'],
  'setName': 'atlas-s4vzoi-shard-0',


Let's get the account collection.

In [18]:
accounts= db.accounts

Print out the type, check that it is a collection: https://docs.mongodb.com/manual/core/databases-and-collections/.

In [19]:
type(accounts)

pymongo.collection.Collection

Let's get the data from it.

In [20]:
result = accounts.find()

Interestingly the result is not a collection: https://docs.mongodb.com/manual/reference/method/db.collection.find/index.html

In [21]:
type(result)

pymongo.cursor.Cursor

dir will tell you that result( Cursor) has a method next, which can be used to iterate through, let's iterate. Be aware, after iterating, the result will be empty (so you could save the data in a list).

In [22]:
for i1,res in enumerate(result):
  if i1%100==0:
    print(res)
print(i1)

{'_id': ObjectId('5ca4bbc7a2dd94ee5816238c'), 'account_id': 371138, 'limit': 9000, 'products': ['Derivatives', 'InvestmentStock']}
{'_id': ObjectId('5ca4bbc7a2dd94ee581623f2'), 'account_id': 168924, 'limit': 10000, 'products': ['InvestmentFund', 'CurrencyService', 'InvestmentStock']}
{'_id': ObjectId('5ca4bbc7a2dd94ee58162457'), 'account_id': 951849, 'limit': 10000, 'products': ['Brokerage', 'InvestmentStock']}
{'_id': ObjectId('5ca4bbc7a2dd94ee581624bb'), 'account_id': 66698, 'limit': 10000, 'products': ['InvestmentStock', 'Commodity', 'Derivatives', 'CurrencyService']}
{'_id': ObjectId('5ca4bbc7a2dd94ee5816251f'), 'account_id': 136139, 'limit': 10000, 'products': ['CurrencyService', 'InvestmentFund', 'InvestmentStock']}
{'_id': ObjectId('5ca4bbc7a2dd94ee58162583'), 'account_id': 785218, 'limit': 10000, 'products': ['Commodity', 'InvestmentStock']}
{'_id': ObjectId('5ca4bbc7a2dd94ee581625e7'), 'account_id': 145588, 'limit': 10000, 'products': ['Derivatives', 'InvestmentFund', 'Investm

Not lets use mongodb to acquire the data we need. Here we will aggregate two collections, accounts with transactions. Further, we use the account_id from accounts and link to the foreignField account_id from transactions, similarly as "join on" in sql. We rename that transacgtions_link. We will also selection only those records which have account_id less than 370000. 

In [None]:
result = accounts.aggregate([
    {
        '$lookup': {
            'from': 'transactions', 
            'localField': 'account_id', 
            'foreignField': 'account_id', 
            'as': 'transactions_link'
        }
    }, {
        '$match': {
            '$expr': {
                '$gt': [
                    370000, '$account_id'
                ]
            }
        }
    }
])

Let's check the result.

In [None]:
for res in result:
  print(res["account_id"],len(res["transactions_link"][0]["transactions"]))

Same story but we could the results.

In [None]:
result = accounts.aggregate([
    {
        '$lookup': {
            'from': 'transactions', 
            'localField': 'account_id', 
            'foreignField': 'account_id', 
            'as': 'transactions_link'
        }
    }, {
        '$match': {
            '$expr': {
                '$gt': [
                    370000, '$account_id'
                ]
            }
        }
    }, {
        '$count': 'account_id'
    }
])

In [None]:
for res in result:
  print(res)

If you want check the results by putting transactions and accounts into two different pandas dataframes and merging them with sql.
For the NoSQL-report the MongoDB query is needed. Using the syntax in the find is enough. You could double check using the pandas approach though.

**Weiterführende Fragen:**

Kannst du auswählen, welche account_id transactions mit weniger als 30 Elemente haben?
Welche account hat die älteste transactions?
Überlege noch 1-2 komplexe Fragen in diesem Datensatz.