In [3]:
# default_exp mgmnt.db.mongo
# author: megretson
# “Excuse all faults of grammar, punctuation, spelling and sense on the score of telegraphic haste.” - William James

# Data Management

> This component is dedicated for controlling and managing data access. We are employing Mongo DB to store, retrieve, and update datasets related to software research practices.

In [4]:
#hide
from nbdev.showdoc import *

# Creating a local mongo instance of traceability or machine_learning data

> The following is a guide to setting up your own local mongo instance of traceabilty or ML data. Sample frozen databases can be found at `/ds4se/data_management/frozen_mongo_databases`

## 0. Get to know mongo
Don't know anything about mongo, or noSQL databases? A little background information goes a long way. The official mongo documentation is excellent, and provides a strong background: https://docs.mongodb.com/manual/introduction/

## 1. Download, install, and run mongo
Follow the instructions to download and install mongo on your system found at https://docs.mongodb.com/manual/administration/install-community/

#### Edition
I use following community edition versions installed via homebrew on mac: 
```
MongoDB shell version v4.2.5
git version: 2261279b51ea13df08ae708ff278f0679c59dc32
allocator: system
modules: none
build environment:
    distarch: x86_64
    target_arch: x86_64

db version v4.2.5
git version: 2261279b51ea13df08ae708ff278f0679c59dc32
allocator: system
modules: none
build environment:
    distarch: x86_64
    target_arch: x86_64
```
The matters primarily for the import and export functionality I will describe below; in general the version of Mongo command line tools used to export the data should be used to import the data. All frozen data I created was using 4.2.5. This is primarily an issue on much older versions of mongo, so consider an update if you find that you already had mongo kicking around on your machine. 

#### Starting mongo
This is an easy step to miss, but you must actually *start* your mongodb as a service or background process. All installation guides found above also include system specific instructions for starting your database. 

## 2. [Optional] Download and install mongo compass
Mongo compass is a free GUI to create, read, update, and delete documents, collections, and databases. It can be incredibly useful for debugging, but is an optional step. Download the Community Edition Stable from here: https://www.mongodb.com/download-center/compass I use version 1.20.5, which is currently the latest stable release.  

If you download and install compass, the connection string for a local database on port 27017 is:
```
mongodb://localhost:27017
```
More on connection strings can be found here: https://docs.mongodb.com/manual/reference/connection-string/

## 3. Import documents into your database
Mongo comes out of the box with simple import and export tools. Further information on mongoimport can be found here, https://docs.mongodb.com/manual/reference/program/mongoimport/ , but in short mongoimport allows you to build a collection from an Extended json created by mongoexport. The version of import you use should match the version of export used to create the json. The sample frozen databases provided in `/ds4se/data_management/frozen_mongo_databases` were created using version `4.2.5`. 
> Future note: there are limitations on import and export. Mongoimport and mongoexport should not be used for full instance production backups, because they do not reliably preserve all rich BSON data types. JSON can only represent a subset of the types supported by BSON. Eventually, we may want to use mongodump and mongorestore as described here https://docs.mongodb.com/manual/reference/program/mongodump/ instead. Here, I'm using the import and export features because the schema for both the traceability and ml datasets are in extended json, hence they do not use the rich BSON types that require dump/restore instead. 

mongoimport is a package component of MongoDB, meaning if you've installed mongo you have also installed mongoimport and export. Run the following from the `/ds4se/data_management/frozen_mongo_databases` to import the `source_raw_20200419_1.json` into a collection called `source_raw_imported` of the database `test`
```
mongoimport --collection=source_raw_imported --db=test --file=source_raw_20200419_1.json

```
> Related: to create your own frozen databases, the command is very similar: `mongoexport --collection=source_raw --db=test --out=source_raw_20200419_1.json `


# Accessing your local mongo instance
> All access of your mongo database should be through this SemeruCollection object. The SemeruCollection object allows for safe access to your documents and includes information on running transformations on your data, including making corpuses. It you always access your data through the SemeruCollection object, it maintains a record of when and how a document was transformed. If the state of your database is edited elsewhere, the SemeruCollection will have undefined behavior, and will likely fail.

## class SemeruCollection(Collection)

SemeruCollection objects inherit from Pymongo Collection objects, documentation here: https://api.mongodb.com/python/current/api/pymongo/collection.html#pymongo.collection.Collection . All functions possible for a pymongo collection object are possible for a SemeruCollection object. SemeruCollections have custom definitions for the following methods: 
* `def insert_one(self, document, bypass_document_validation=False, session=None)`
* `def delete_one(self, filter, collation=None, session=None)`

and have the additional methods:

* `def run_transformation(self, query, function, transformation_collection_name)`
* and the static method `def link_ground_truth(id_1, collection_1, id_2, collection_2):`

In [87]:
# export
from pymongo import MongoClient
from pymongo.collection import Collection
from jsonschema import validate
from jsonschema import exceptions as json_exceptions
from json import loads
import warnings

class SemeruCollection(Collection):
    """ 
    Get / create a Mongo collection, with overriden insertion and deletion rules.  
    
    Overriden from PyMongo class object: documentation can be found here 
    https://api.mongodb.com/python/current/api/pymongo/collection.html#pymongo.collection.Collection

    
    :Attributes:
        - `full_name`: The full name of this :class:`Collection`. The full name is of the form 
        `database_name.collection_name`.
        - `name`: The name of this :class:`Collection`.
        - `database`: The :class:`~pymongo.database.Database` that this :class:`Collection` is a part of.
        - `raw_schema`: The json schema for raw documents inserted into this collection, stored as dict
        - `transform_schema`: The json schema for transformed documents inserted into this collection, stored as dict

    """

    def __init__(self, database, name, raw_schema, transform_schema, create=False, codec_options=None,
                 read_preference=None, write_concern=None, read_concern=None, session=None, **kwargs):
        """
        Get / create a Semeru Mongo collection
        ...

        :Args:
          - `database`: the database to get a collection from
          - `name`: the name of the collection to get
          - `raw_schema`: the path to a json schema for raw documents
          - `transform_schema`: the path to a json schema for transformed documents
          - `create` (optional): if ``True``, force collection
            creation even without options being set
          - `codec_options` (optional): An instance of
            :class:`~bson.codec_options.CodecOptions`. If ``None`` (the
            default) database.codec_options is used.
          - `read_preference` (optional): The read preference to use. If
            ``None`` (the default) database.read_preference is used.
          - `write_concern` (optional): An instance of
            :class:`~pymongo.write_concern.WriteConcern`. If ``None`` (the
            default) database.write_concern is used.
          - `read_concern` (optional): An instance of
            :class:`~pymongo.read_concern.ReadConcern`. If ``None`` (the
            default) database.read_concern is used.
          - `collation` (optional): An instance of
            :class:`~pymongo.collation.Collation`. If a collation is provided,
            it will be passed to the create collection command. This option is
            only supported on MongoDB 3.4 and above.
          - `session` (optional): a
            :class:`~pymongo.client_session.ClientSession` that is used with
            the create collection command
          - `**kwargs` (optional): additional keyword arguments will
            be passed as options for the create collection command
        """
        
        super(SemeruCollection, self).__init__(database, name, create, codec_options, read_preference,
                                               write_concern, read_concern, session, **kwargs)
        
        self._raw_schema_path = raw_schema
        self._transform_schema_path = transform_schema

        with open(raw_schema) as raw:
            self.raw_schema = loads(raw.read())

        with open(transform_schema) as transform:
            self.transform_schema = loads(transform.read())

    def insert_one(self, document, bypass_document_validation=False,
                   session=None):
        """Insert a single document, enforcing schema rules.

        :Parameters:
          - `document`: The document to insert. Must be a Python Dict. If the document does 
          not have an _id field one will be added automatically.
          - `bypass_document_validation`: (optional) If ``True``, allows the
            write to opt-out of document level validation. Default is
            ``False``. All SemeruCollection documents are validated regardless of this param. 
          - `session` (optional): a
            :class:`~pymongo.client_session.ClientSession`.


        :Returns:
          The result document.
        """

        # validate the document
        document_type = self.__validate_document(document)

        # if the document is raw, then insert it into the database
        if document_type == "raw":
            return self.__insert_raw_document(document, bypass_document_validation, session)
        else:
            return self.__insert_transform_document(document, bypass_document_validation, session)

    def __insert_raw_document(self, document, bypass_document_validation=False, session=None):
        """ Internal insertion helper for raw documents
        
        This method turns on warnings regardless of user preferences, and raises a warning if a document is inserted 
        which references documents (either by ID or system/name pair) that are not found in the database. 
        
        :Parameters:
          - `document`: The document to insert. Must be a mutable mapping
            type. If the document does not have an _id field one will be
            added automatically.
          - `bypass_document_validation`: (optional) If ``True``, allows the
            write to opt-out of document level validation. Default is
            ``False``. All SemeruCollection documents are validated regardless of this param. 
          - `session` (optional): a
            :class:`~pymongo.client_session.ClientSession`.
            
        :Returns:
          The result document.
        """

        db = self.database
        associated_files = document["ground_truth"]

        for file in associated_files:
            # print(file)

            with warnings.catch_warnings():

                # Turn on warnings for this section of code, regardless of user preferences
                warnings.simplefilter('always')

                # Default to using a document_id, check if referenced document can be found 
                if "document_id" in file.keys():
                    if db[file["collection"]].find_one({"_id": file["document_id"]}) is None:
                        warnings.warn("Document references file with document id \'{}\', which cannot be found in "
                                      "collection \'{}\'. \n Please add related document to collection \'{}\'".format(
                                        file["document_id"],
                                        file["collection"],
                                        file["collection"]))

                # But use the name/system pair if necessary
                elif "name_and_system" in file.keys():
                    if db[file["collection"]].find_one({"name": file["name_and_system"][0],
                                                        "system": file["name_and_system"][1]}) is None:
                        warnings.warn(
                            "Document references file with system \'{}\' and name \'{}\', which cannot be found "
                            "in collection \'{}\'. \n Please add related document to collection \'{}\'".format(
                                file["name_and_system"][1],
                                file["name_and_system"][0],
                                file["collection"],
                                file["collection"]))

        raw_doc = super().insert_one(document, bypass_document_validation, session)
        return raw_doc

    def __insert_transform_document(self, document, bypass_document_validation, session):
        """ Internal insertion helper for transform documents
        
        This method raises an error if the document this document is transformed from cannot be found. 
        
        :Parameters:
          - `document`: The document to insert. Must be a mutable mapping
            type. If the document does not have an _id field one will be
            added automatically.
          - `bypass_document_validation`: (optional) If ``True``, allows the
            write to opt-out of document level validation. Default is
            ``False``. All SemeruCollection documents are validated regardless of this param. 
          - `session` (optional): a
            :class:`~pymongo.client_session.ClientSession`.
            
        :Returns:
          The result document.
        """
        
        db = self.database

        # The document needs to have a valid transformation history to be inserted
        if len(document["transformed_from"]) == 0:
            raise Exception("Transformation Document has no \"transformed_from\" field.")

        # The transformed_from fields must map to valid documents in the database
        for transformed_from in document["transformed_from"]:
            transformed_from_collection = db[transformed_from["collection"]]
            if transformed_from_collection.find_one({"_id": transformed_from["document_id"]}) is None:
                raise Exception("Cannot locate document with id {} in collection \'{}\'".format(
                    transformed_from["document_id"], transformed_from_collection))

        transformation_identifier = document["transformation_identifier"]
        transform_doc = super().insert_one(document, bypass_document_validation, session)
        transform_id = transform_doc.inserted_id

        # Add the new transformed document to the applied transformation fields of the transformed_from documents
        for transformed_from in document["transformed_from"]:
            transformed_from_collection = db[transformed_from["collection"]]

            update_applied_transform = {"$addToSet": {"applied_transformations": {
                "collection": self.name,
                "document_id": transform_id,
                "transformation_identifier": transformation_identifier}}}

            transformed_from_collection.update_one({"_id": transformed_from["document_id"]}, update_applied_transform)

        return transform_doc

    def __validate_document(self, document):
        """ Internal document validation helper, uses Object's raw and transform schema attributes
        
        :Parameters:
          - `document`: The document to insert. Must be a mutable mapping
            type. If the document does not have an _id field one will be
            added automatically.
            
        :Returns:
          The document type: ["transform", "raw"]
        """

        # try to validate against raw schema
        try:
            validate(document, self.raw_schema)

            # if it passes, set type to raw and validate the collection name, so that it is of style XXXX_raw
            document_type = "raw"
            if self.full_name.split("_")[-1] != "raw":
                raise TypeError("Document validates against raw_schema, but current collection is not a raw collection")

        except json_exceptions.ValidationError as raw_error:

            # if it fails, try to validate against transform schema
            try:
                validate(document, self.transform_schema)

                # if it passes, set type to transform and validate the collection name, so that is is of
                # style XXX_transform
                document_type = "transform"
                if self.full_name.split("_")[-1] != "transform":
                    raise TypeError("Document validates against transform_schema, but current collection is not a "
                                    "transform collection")

            except json_exceptions.ValidationError as transform_error:

                # if it fails, throw an unrecognized type error, and say it couldn't be validated against either schema.
                raise TypeError("Document does not validate against raw or transform schema. ",
                                raw_error, transform_error)
        return document_type

    def delete_one(self, filter, collation=None, session=None):
        """Delete a single document matching the filter.  Will not delete documents with children. 

          >>> db.test.count_documents({'x': 1})
          3
          >>> result = db.test.delete_one({'x': 1})
          >>> result.deleted_count
          1
          >>> db.test.count_documents({'x': 1})
          2

        :Parameters:
          - `filter`: A query that matches the document to delete.
          - `collation` (optional): An instance of
            :class:`~pymongo.collation.Collation`. This option is only supported
            on MongoDB 3.4 and above.
          - `session` (optional): a
            :class:`~pymongo.client_session.ClientSession`.

        :Returns:
          - An instance of :class:`~pymongo.results.DeleteResult`.
        """
        try:
            transforms = self.find_one(filter)["applied_transformations"]
            if transforms.len() != 0: 
                raise TypeError("Document has applied transformations (children) and cannot be deleted.")
            else:
                return super().delete_one(filter, collation, session)
        except KeyError:
            return super().delete_one(filter, collation, session)
    
    def delete_many():
        """Delete one or more documents matching the filter. Will not delete documents with children. 

          >>> db.test.count_documents({'x': 1})
          3
          >>> result = db.test.delete_many({'x': 1})
          >>> result.deleted_count
          3
          >>> db.test.count_documents({'x': 1})
          0

        :Parameters:
          - `filter`: A query that matches the documents to delete.
          - `collation` (optional): An instance of
            :class:`~pymongo.collation.Collation`. This option is only supported
            on MongoDB 3.4 and above.
          - `session` (optional): a
            :class:`~pymongo.client_session.ClientSession`.

        :Returns:
          - An instance of :class:`~pymongo.results.DeleteResult`.

        """
        for document in self.find(query):
            try:
                transforms = document["applied_transformations"]
                if transforms.len() != 0:
                    raise TypeError("Document {} has applied transformations (children) and cannot be deleted.".format(document["document_id"]))
            except KeyError:
                continue 
        
        super().delete_many(filter, collation, session)

    def run_transformation_one_to_one(self, query, function, transformation_collection_name):
        """ Run a transformation on a set of documents
        
        :Parameters:
          - `query`: A query to select a subset of documents on which to run the transformation. EX.
          {} selects all documents in the collection, {"system": "Albergate"} selects documents with system "Albergate".
          Documentation on this query string can be found here: https://docs.mongodb.com/manual/tutorial/query-documents/
          - `function`: A function to run on selected documents. The function must take file_contents as a string, and 
          return file_contents as a string. 
          - `transformed_collection_name`: The name of the collection to place the new transformed documents in

                     
        :Returns:
          The transformed documents as a list.
        """
        
        import datetime
        transformation_identifier = {"function_name": function.__name__, "timestamp": datetime.datetime.now()}
        transformed_documents = list()

        # Takes a function, and runs the transformation on each document which matches the query
        for document in self.find(query):
            transformed_text = function(document["contents"])

            # construct new transformed version of the document with transformed from field
            transform_document = {"name": document["name"],
                                  "system": document["system"],
                                  "applied_transformations": [],
                                  "contents": transformed_text,
                                  "transformation_identifier": transformation_identifier,
                                  "transformed_from": [{"collection": self.name,
                                                        "document_id": document["_id"]}]
                                  }

            # insert the new transformed document
            db = self.database
            transformation_collection = SemeruCollection(database=db, name=transformation_collection_name, 
                                                         raw_schema=self._raw_schema_path,
                                                         transform_schema=self._transform_schema_path)
            transform_doc_id = transformation_collection.insert_one(transform_document).inserted_id
            transformed_documents.append(transform_document)
            
            # update the transformed_from document with the applied transformation
            update_query = {"$push": {"applied_transformations": {"collection": transformation_collection.name,
                                                                     "transformation_identifier": transformation_identifier,
                                                                     "document_id": transform_doc_id}}}
            result = self.update_one({'_id': document['_id']}, update_query)
        
        return transformed_documents
    
    def run_transformation_many_to_one(self, query, function, transformation_collection_name):
        """ Run a transformation on a set of documents, creating one document 
        
        :Parameters:
          - `query`: A query to select a subset of documents on which to run the transformation. EX.
          {} selects all documents in the collection, {"system": "Albergate"} selects documents with system "Albergate".
          Documentation on this query string can be found here: https://docs.mongodb.com/manual/tutorial/query-documents/
          - `function`: A function to run on selected documents. The function must take file_contents as a list of 
          strings, and return file_contents as a string. 
          - `transformed_collection_name`: The name of the collection to place the new transformed document in

                     
        :Returns:
          The transformed document
        """
        import datetime
        transformation_identifier = {"function_name": function.__name__, "timestamp": datetime.datetime.now()}
        document_contents = list()
        document_ids = list()

        # Takes a function, and runs the transformation on each document which matches the query
        for document in self.find(query):
            document_contents.append(document["contents"])
            document_ids.append(document["_id"])

        new_transform_contents = function(document_contents)

        # construct new transformed version of the document with transformed from field
        transform_document = {"name": document["name"],
                              "system": document["system"],
                              "applied_transformations": [],
                              "contents": document_contents,
                              "transformation_identifier": transformation_identifier,
                              "transformed_from": [{"collection": self.name,
                                                    "query": query}]
                              }

        # insert the new transformed document
        db = self.database
        transformation_collection = SemeruCollection(database=db, name=transformation_collection_name, 
                                                     raw_schema=self._raw_schema_path,
                                                     transform_schema=self._transform_schema_path)
        transform_doc_id = transformation_collection.insert_one(transform_document).inserted_id
        
        # update the transformed_from documents with the applied transformation
        for id in document_ids:

            update_query = {"$push": {"applied_transformations": {"collection": transformation_collection.name,
                                                                     "transformation_identifier": transformation_identifier,
                                                                     "query": query}}}
            result = self.update_one({'_id': id}, update_query)
        
        return transform_document

    @staticmethod
    def link_ground_truth(id_1, collection_1, id_2, collection_2):
        """ Static helper to link ground truth documents. Primarily used for traceability datasets
                
        :Parameters:
          - `id_1`: Mongo Document_ID string of first document
          - `collection_1`: Collection object where first document is located
          - `id_2`: Mongo Document_ID string of second document
          - `collection_2`: Collection object where second document is located

        """

        query_1 = {"_id": id_1}
        new_query_1_value = {"$addToSet": {"ground_truth": (collection_2.name, id_2)}}

        query_2 = {"_id": id_2}
        new_query_2_value = {"$addToSet": {"ground_truth": (collection_1.name, id_1)}}

        collection_1.update_one(query_1, new_query_1_value)
        collection_2.update_one(query_2, new_query_2_value)

## Usage of SemeruCollection()
### Necessary Imports

To run the SemeruCollection code, you will also need to use the pymongo MongoClient object. Documentation on MongoClients can be found here, https://api.mongodb.com/python/current/api/pymongo/mongo_client.html#pymongo.mongo_client.MongoClient , but in general MongoClients allow connections to Mongo instances and all databases on those instances

In [88]:
from pymongo import MongoClient
# from Semeru_Collection import SemeruCollection

### Instantiate a SemeruCollection object
To run this notebook yourself, you may need to change the database in the line `db = client.test`

In [89]:
import pprint

client = MongoClient('localhost', 27017)
db = client.test
test_collection = SemeruCollection(database=db, name="example_requirement_raw", raw_schema="./DB_Schema/raw_schema.json",
                        transform_schema="./DB_Schema/transformed_schema.json")


### CRUD operations
#### Creates
The following is an example of creating a single document and inserting it. It is entirely possible that running this example will throw a warning for you. 

The example document which is being inserted references another file: a source code file named "sample.java" that this requirement, "UC58.txt", is linked to. When you insert a document with ground truth or applied transformations which are not currently in the database, it will throw an ignorable warning. This is desired behaviour. Let's say that you are inserting a new set of associated test cases, requirements, and source code in that order. When you insert the test cases and requirements into the database they will throw warnings, because the other associated files are not in the database yet. However when you insert the source code, they will not throw errors because their associated files have already been inserted. This allows the user to ensure proper insertion of all files; if any source code does throw a Document reference warning, it indicates a test case or requirement was not inserted that should have been. 

In [90]:
sample = {'name': 'test.TXT', 
          'system': 'test_system', 
          'applied_transformations': [], 
          'ground_truth': [{"collection": "example_source_raw",
                            "name_and_system": ["eTour","sample.java"]
                          }],
          'contents': 'Use case name VISUALIZZASCHEDASITO \nView the details of a particular site. \nPartecipating '
                      '\nActor initialized by Tourist \nEntry \nconditions \x95 The Tourist has successfully '
                      'authenticated to the system and is located in one of the following areas: Research Results, '
                      'List of Sites Visited Sites and List of Favorites \nFlow of events User System \n1. Select the '
                      'function for displaying the card on a site chosen. \n2 Upload data from the database. \nExit '
                      'conditions \x95 The system displays the details of the selected site. \n\x95 Interruption of '
                      'the connection to the server ETOUR. \nQuality \nrequirements'}

sample_id = test_collection.insert_one(sample).inserted_id
print("Inserted_id of the document is " + str(sample_id))

Inserted_id of the document is 5e9e01afe895708fb7e3a33d


 Please add related document to collection 'example_source_raw'


#### Reads
Reads are the simplest aspect of using the Semeru Collection, and they operate no differently from any pymongo read. You can get documents in a few different ways. Reads rely on queries: for more information on query documents beyond these basic examples, refer to the mongodb documentation here: https://docs.mongodb.com/manual/tutorial/query-documents/

##### Get a single document using find_one()
This returns a single document matching a query. If this test fails, please run the create code above. 

In [91]:
any_document = test_collection.find_one({"system":"test_system"})
pprint.pprint(any_document)

{'_id': ObjectId('5e90cd2075b6ed838f7dbfc8'),
 'applied_transformations': [{'collection': 'test.new_translation_collection_transform',
                              'document_id': ObjectId('5e9dfcede895708fb7e3a2f1'),
                              'transformation_identifier': {'function_name': 'translate_text',
                                                            'timestamp': datetime.datetime(2020, 4, 20, 15, 50, 4, 905000)}},
                             {'collection': 'new_translation_collection_transform',
                              'document_id': ObjectId('5e9dfcede895708fb7e3a2f1'),
                              'transformation_identifier': {'function_name': 'translate_text',
                                                            'timestamp': datetime.datetime(2020, 4, 20, 15, 50, 4, 905000)}},
                             {'collection': 'test.new_translation_collection_transform',
                              'document_id': ObjectId('5e9dfdb9e895708fb7e3a2ff'),
 

##### Get a specific document by id
You may also want to get a specific document, rather than just any document that matches your query. Here we will find the document we inserted above. 

In [92]:
specific_document = test.find_one({"_id": sample_id})
pprint.pprint(specific_document)

{'_id': ObjectId('5e9e01afe895708fb7e3a33d'),
 'applied_transformations': [],
 'contents': 'Use case name VISUALIZZASCHEDASITO \n'
             'View the details of a particular site. \n'
             'Partecipating \n'
             'Actor initialized by Tourist \n'
             'Entry \n'
             'conditions \x95 The Tourist has successfully authenticated to '
             'the system and is located in one of the following areas: '
             'Research Results, List of Sites Visited Sites and List of '
             'Favorites \n'
             'Flow of events User System \n'
             '1. Select the function for displaying the card on a site '
             'chosen. \n'
             '2 Upload data from the database. \n'
             'Exit conditions \x95 The system displays the details of the '
             'selected site. \n'
             '\x95 Interruption of the connection to the server ETOUR. \n'
             'Quality \n'
             'requirements',
 'ground_truth': [{'co

##### Get multiple documents
You may wish to return a whack of documents, if this is the case, then use the .find() function, shown here. The find function returns a cursor object, not a document. 

In [93]:
# this query is looking for documents in the test_system, with elements in the ground_truth array
system_query = {"system": "test_system", "ground_truth.0": { "$exists": True }}
documents_matching_query = test_collection.find(system_query)
for doc in documents_matching_query:
    pprint.pprint(doc)

{'_id': ObjectId('5e90ce3b75b6ed838f7dbfca'),
 'applied_transformations': [{'collection': 'test.new_translation_collection_transform',
                              'document_id': ObjectId('5e9dfcede895708fb7e3a2f2'),
                              'transformation_identifier': {'function_name': 'translate_text',
                                                            'timestamp': datetime.datetime(2020, 4, 20, 15, 50, 4, 905000)}},
                             {'collection': 'new_translation_collection_transform',
                              'document_id': ObjectId('5e9dfcede895708fb7e3a2f2'),
                              'transformation_identifier': {'function_name': 'translate_text',
                                                            'timestamp': datetime.datetime(2020, 4, 20, 15, 50, 4, 905000)}},
                             {'collection': 'test.new_translation_collection_transform',
                              'document_id': ObjectId('5e9dfdbae895708fb7e3a300'),
 

                              'transformation_identifier': {'function_name': 'translate_text',
                                                            'timestamp': datetime.datetime(2020, 4, 20, 15, 56, 53, 27000)}},
                             {'collection': 'new_translation_collection_transform',
                              'document_id': ObjectId('5e9dfe89e895708fb7e3a31a'),
                              'transformation_identifier': {'function_name': 'translate_text',
                                                            'timestamp': datetime.datetime(2020, 4, 20, 15, 56, 53, 27000)}},
                             {'collection': 'new_translation_collection_transform',
                              'document_id': ObjectId('5e9e00b3e895708fb7e3a32a'),
                              'transformation_identifier': {'function_name': 'translate_text',
                                                            'timestamp': datetime.datetime(2020, 4, 20, 16, 6, 5, 58000)}},
     

#### Transforms
There are two types of transforms, one to one transformations, and one to many transformations. To run either, you must define your transformation functions according to certain parameters. 
##### One to one transformation
Here I'm going to show a basic example of a one to one transformation used to translate text from one language to another. 

In [94]:
# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install googletrans
from googletrans import Translator

# The function you pass into to run_transformation_one_to_one must take a string as its only paramter 
# and return a string, which will become the new file contents. 
def translate_text(text):
    translator = Translator()
    translated_text = translator.translate(text).text
    return translated_text

transformed_docs = test_collection.run_transformation_one_to_one(query = {}, 
                                                                 function = translate_text, 
                                                                 transformation_collection_name = "new_translation_collection_transform")
for doc in transformed_docs:
    pprint.pprint(doc)

{'_id': ObjectId('5e9e01b5e895708fb7e3a33e'),
 'applied_transformations': [],
 'contents': 'Use case name VISUALIZZASCHEDASITO \n'
             'View the details of a particular site. \n'
             'Partecipating \n'
             'Actor initialized by Tourist \n'
             'Entry \n'
             'conditions \x95 The Tourist has successfully authenticated to '
             'the system and is located in one of the following areas: '
             'Research Results, List of Sites Visited Sites and List of '
             'Favorites \n'
             'Flow of events User System \n'
             '1. Select the function for displaying the card on a site '
             'chosen. \n'
             '2 Upload data from the database. \n'
             'Exit conditions \x95 The system displays the details of the '
             'selected site. \n'
             '\x95 Interruption of the connection to the server ETOUR. \n'
             'Quality \n'
             'requirements',
 'name': 'UC58.TXT',
 

##### Many to one transformation
Implementation pending
#### Updates
Implementation pending
#### Deletes
Implementation pending

In [11]:
from nbdev.export import *
notebook2script()

Converted 00_mgmnt.prep.i.ipynb.
Converted 01_exp.i.ipynb.
Converted 02_mgmnt.db.mongo.ipynb.
Converted 03_repr.i.ipynb.
Converted 04_mining.ir.model.ipynb.
Converted 05_mining.ir.i.ipynb.
Converted 06_benchmark.traceability.ipynb.
Converted 07_repr.roberta.train.ipynb.
Converted 08_exp.info.ipynb.
Converted 09_desc.stats.ipynb.
Converted 10_vis.ipynb.
Converted 11_mgmnt.prep.conv.ipynb.
Converted 12_repr.roberta.eval.ipynb.
Converted 14_mgmnt.prep.bpe.ipynb.
Converted 15_desc.metrics.se.ipynb.
Converted 16_repr.word2vec.train.ipynb.
Converted 17_repr.doc2vec.train.ipynb.
Converted 18_repr.doc2vec.eval.ipynb.
Converted 19_repr.word2vec.eval.ipynb.
Converted 20_benchmark.codegen.ipynb.
Converted 21_inf.i.ipynb.
Converted 22_inf.bayesian.ipynb.
Converted 23_inf.causal.ipynb.
Converted 24_mgmnt.corpus.ipynb.
Converted aa_blog.example.ipynb.
Converted ab_templates.example.ipynb.
Converted ac_emp.eval.pp1.rq1.ipynb.
Converted ad_emp.eval.pp1.rq2.ipynb.
Converted ae_emp.eval.pp1.rq3.ipynb.
C