<a href="https://colab.research.google.com/github/WhiteHum/Application-security/blob/main/1_04_Querying_Document_Stores.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Querying Document Stores

## Overview

The ability to interact with datastores of various kinds is critical for data science and machine learning applications.  In this lab, we will experiment with a MongoDB document store, learning how to reverse engineer models to the extent needed to query data effectively.

## Goals

By the end of this lab, you should be able to:

 * Connect to a MongoDB document store
 * Perform queries
 * Understand how to determine the structure of documents
 * Create abstractions that allow you to easily interact with a MongoDB store
 
## Estimated Time: 30 - 45 minutes

Interacting with a MongoDB store using the Mongo command line interface is fairly straightforward, but it is not the most typical way that someone with our role will interact with the server.  Instead, we would have credentials that allow us to remotely interface with the data.  Additionally, since we would often want to perform some transformation or analysis on the data, it makes sense to learn to interact with the database from within a programming environment that allows us to manipulate the data easily.

Ultimately, we would like to use the *mongoengine* package to talk to the database.  This requires that we understand the structure of the documents within the data store.  Determining the structure of the data requires that we use a lower-level interface to talk to the data store, at least at first.  We will use *pymongo*.

# <img src="../images/task.png" width=20 height=20> Task 4.1

The portion of the *pymongo* package that we need to use is the `MongoClient` class.  Please import this from `pymongo`.

In [None]:
from pymongo import MongoClient

Our current goal is to determine the names of the databases within the MongoDB server.  To do this, we need to establish a connection to the server.

The `MongoClient` class has a convenience function, `list_database_names()`, that can tell us about these databases.  To use it we must instantiate a copy of the `MongoClient` class.

The class initializer can accept a MongoDB connection string.  These strings are typically of the form `mongodb://server:27017`.  If we already know the name of the database that we wish to interact with, this can also be passed as a part of the connection string as though it were the path in a URL.

# <img src="../images/task.png" width=20 height=20> Task 4.2

Create an instance of the `MongoClient` class.  Create and pass a connection string that uses the IP address of your virtual machine running on your workstation.  Once you have successfully connected, list the names of the databases that are available.

In [None]:
vm_address = "192.168.100.129" #this must be changed to match your VM address!

connection = MongoClient(f'mongodb://{vm_address}:27017')
connection.list_database_names()

['admin', 'config', 'local', 'scoreserver']

For our lab, the database of interest is the *scoreserver* database.  We can reference this database as though it were an attribute of the `MongoClient` instance that we have created.  For example:

```
connection = MongoClient(f'mongodb://{vm_address}:27017')
connection.scoreserver # <- This references the scoreserver database on the MongoDB server
```

Since we can reference the database in this way, it makes sense to assign this to some shorter variable for easy access.

Now that we have identified the databases present and selected the *scoreserver* database, we'd like to know about the different document collections that are available.  We can obtain this list using the `list_collection_names()` convenience function that is available on the database object.

# <img src="../images/task.png" width=20 height=20> Task 4.3

Create a variable to reference the *scoreserver* database directly.  Then use this variable to obtain a list of the collections stored in the database.

In [None]:
db = connection.scoreserver
db.list_collection_names()

['eventmodels',
 'metadatamodels',
 'svgmodels',
 'questionmodels',
 'levelmodels',
 'teammodels',
 'usermodels',
 'gamemodels',
 'sectionmodels']

We can see that there are several different collections of data available.  Let's work with the *usermodels* collection to start with.

A collection can be referenced directly as though we were accessing a dictionary.  For example:

```
connection = MongoClient(f'mongodb://{vm_address}:27017')
db = connection.scoreserver
db['usermodels'].find_one()
```

You can see above how we have referenced the `usermodels` collection as though it were a dictionary key.  For convenience, we might also assign this reference to a variable.

# <img src="../images/task.png" width=20 height=20> Task 4.4

Create a variable `users` that can be used to access the `usermodels` collection.  Using this variable, use the `find_one()` function to retrieve one record from the database.

In [None]:
users = db['usermodels']
users.find_one()

{'_id': ObjectId('5a858954fc0efd33e1d6ceef'),
 'password': 'REMOVED',
 'name': 'jnovak',
 'games': [],
 'gameArray': ['Migrated'],
 'rights': 256,
 'sessionId': 'kZkZt9bNxRjeS0fM3bDBYGRFboxWxbrSPj8WzBrJR8WnU5e2',
 'newGameArray': [{'gameId': ObjectId('5a8418fc75b9e8f7c23809d4'),
   'sessionId': '',
   'score': 0,
   'hintsTaken': 0,
   'pointsLost': 0,
   '_id': ObjectId('5b92a2e5d39eab5a96dc63ed'),
   'gameData': {'5a8418fc75b9e8f7c23809f7': {'correct': False,
     'attempts': 0,
     'hintsTaken': 0},
    '5a8418fc75b9e8f7c23809f5': {'correct': False,
     'attempts': 0,
     'hintsTaken': 0},
    '5a8418fc75b9e8f7c23809f3': {'correct': False,
     'attempts': 0,
     'hintsTaken': 0},
    '5a8418fc75b9e8f7c23809f0': {'correct': False,
     'attempts': 0,
     'hintsTaken': 0},
    '5a8418fc75b9e8f7c23809ee': {'correct': False,
     'attempts': 0,
     'hintsTaken': 0},
    '5a8418fc75b9e8f7c23809eb': {'correct': False,
     'attempts': 0,
     'hintsTaken': 0},
    '5a8418fc75b9e8f7

Consider the structure of the JSON object that was returned.  We can see that JSON objects are organized as keys and values.  Some of those values are additional JavaScript objects, some are arrays.  We can use this structure to create a Python object that we can use to easily work with the data.

Throughout this course, you will find that a great deal of our time is spent manipulating or transforming data to get it into a form that we can use for machine learning.  While we could certainly write code to work with the JSON object directly, or even convert the JSON object into a series of nested dictionaries, a little bit of work up-front can make accessing the data now and in the future much easier.

To this end, we'd like to build a class that represents the data within the `usermodels` collection.  If we create this class abstraction, we can then leverage MongoEngine to read the data in the collection into these objects and work with them much more easily.

Doing so, as stated above, requires that we define a *class*.  You can think of a class as an abstraction that collects different attributes that represent a thing.  In this case, we are talking about user objects.  What sorts of attributes do user objects have?  Usually they have usernames, passwords, and other similar attributes.

To create the class using MongoEngine, we need to import all of the functionality within `mongoengine`.  While we would typically discourage you from importing things directly into the global namespace, in this case it seems harmless since we are building more of a tool specifically for interacting with the database.

Once this has been imported, we can define various types of data.  These are the types most relevant for our current task:

* `StringField()` can store any arbitrary length string
* `IntField()` is used to store integer or numeric data, but not floating point
* `DateTimeField()` is used to store timestamps
* `ListField()` is used to represent a collection or array of some other type of field

Since we are switching to MongoEngine, we must also connect to the database using the MongoEngine connection handler.  Everything in MongoEngine depends on this.  The format of it is:

```
connect(db_name, host=server_address, port=27017)
```

# <img src="../images/task.png" width=20 height=20> Task 4.5

Create a class named `Usermodels`.  This class should inherit from the `Document` class provided by MongoEngine.  Define sufficient fields within the `Usermodels` class to capture all of the top-level elements of the `usermodels` collection.  If you aren't sure what type to make something, `StringField()` is very forgiving.

Once you have created this class, `connect()` to the database and use your new class to retrieve the first user object in the collection.  Print out the values of the top level fields.

In [None]:
from mongoengine import *
import datetime   # We need this to parse dates

class Usermodels(Document):
    password = StringField()
    name = StringField()
    rights = IntField()
    sessionId = StringField()
    gameArray = ListField(StringField())
    newGameArray = ListField(StringField())
    games = ListField(StringField())
    updated = DateTimeField()

connect('scoreserver', host=vm_address, port=27017)
user = Usermodels.objects().first()
print(user.name, user.password, user.rights, user.sessionId, user.updated)

jnovak REMOVED 256 kZkZt9bNxRjeS0fM3bDBYGRFboxWxbrSPj8WzBrJR8WnU5e2 2018-09-07 16:10:13.026000


Within the data that we loaded, we have a pretty large and deep structure in the `newGameArray` field.  No doubt you have loaded this as a `ListField(StringField())`.  It turns out that MongoEngine has done something very useful for us.

Looking within that data, we can see that there are many instances of `ObjectId()`.  These values represent the `_id` column from a document collection.  Wouldn't it be nice if we could use one of those values to query a collection?

At first glance, it might appear that we would need to somehow parse that data to access the fields within it.  It turns out that this isn't the case.  Let's experiment.

# <img src="../images/task.png" width=20 height=20> Task 4.6

The `newGameArray` data should have been loaded as a `ListField()`.  As such, it is now a Python array or list.  Since each element has only one of these arrays, all of the data is available as the first element in that field.

Use the Python `type()` function to determine the type of the first element in the `newGameArray` value from the first record in the collection.

In [None]:
print(type(user.newGameArray[0]))

<class 'mongoengine.base.datastructures.BaseDict'>


Now that's interesting!  Even though we told it to treat this as an array of strings, it has loaded it as something it is referring to as `BaseDict`.  That name is a strong indication of the underlying type of this data.  This is a Dictionary!

This is one of the great advantages of MongoEngine.  We can avoid parsing JSON documents by leveraging its built-in conversions.

Let's see if we can use some of this data to retrieve related data from another collection.

# <img src="../images/task.png" width=20 height=20> Task 4.7

Copy the data in the `gameData` key in the first element of the `newGameArray` from the record that we have loaded into a new variable.

In [None]:
gameData = user.newGameArray[0]['gameData']
gameData

{'5a8418fc75b9e8f7c23809f7': {'correct': False,
  'attempts': 0,
  'hintsTaken': 0},
 '5a8418fc75b9e8f7c23809f5': {'correct': False,
  'attempts': 0,
  'hintsTaken': 0},
 '5a8418fc75b9e8f7c23809f3': {'correct': False,
  'attempts': 0,
  'hintsTaken': 0},
 '5a8418fc75b9e8f7c23809f0': {'correct': False,
  'attempts': 0,
  'hintsTaken': 0},
 '5a8418fc75b9e8f7c23809ee': {'correct': False,
  'attempts': 0,
  'hintsTaken': 0},
 '5a8418fc75b9e8f7c23809eb': {'correct': False,
  'attempts': 0,
  'hintsTaken': 0},
 '5a8418fc75b9e8f7c23809e8': {'correct': False,
  'attempts': 0,
  'hintsTaken': 0},
 '5a8418fc75b9e8f7c23809e7': {'correct': False,
  'attempts': 0,
  'hintsTaken': 0},
 '5a8418fc75b9e8f7c23809e4': {'correct': False,
  'attempts': 0,
  'hintsTaken': 0},
 '5a8418fc75b9e8f7c23809e2': {'correct': False,
  'attempts': 0,
  'hintsTaken': 0},
 '5a8418fc75b9e8f7c23809e0': {'correct': False,
  'attempts': 0,
  'hintsTaken': 0},
 '5a8418fc75b9e8f7c23809de': {'correct': False,
  'attempts': 0,



# <img src="../images/task.png" width=20 height=20> Task 4.8

The keys in this also appear to be object ID values.  To save just a little bit of time, we will tell you that they are from the `questionmodels` collection.  Use the next cell to build a class to represent question models and then retrieve all of the questions referenced by  data in your new variable.  What is the total value of all of the questions referenced?

***Important Note:*** When you build this class, you will almost certainly run into a problem defining the field `__v`.  This is because of the special treatment of double-underscore variables in Python.  You can get around this issue by using the `db_field='__v'` keyword parameter when defining the type for the variable that you wish to assign to this field.

In [None]:
questions = db['questionmodels']
questions.find_one()
class Questionmodels(Document):
    typeOfAnswer = IntField()
    proof = StringField()
    value = IntField()
    numHints = IntField()
    index = IntField()
    questionMarkdown = StringField()
    title = StringField()
    hints = ListField(StringField())
    distractors = StringField()
    answer = ListField(StringField())
    metadata = ListField(StringField())
    v = IntField(db_field='__v')
    
total = 0
for question_id in gameData.keys():
    questions = Questionmodels.objects.get(id=question_id)
    total = total + questions.value
total

30

# Conclusion

In this lab, we have formalized our understanding of MongoDB collections.  More importantly, we have learned how to interact with these document stores to discover the structure of the underlying data and create a useful Python interface to abstract away the difficulties of interacting with JSON and the sometimes challenging query structure of document stores.