## Data insertion

There are different ways of inserting documents into a collection in MongoDB. In the previous chapter we saw already the use of the `insert_one` method. Now we'll go more in depth about the workings of it and the other common ways of inserting data.

**Reminder:** Have the docker container running for this session.

In [None]:
from pymongo import MongoClient

client = MongoClient(host='localhost', port=27017, username='mongo', password='mongo')

# We'll be (lazily) creating the database 'catalogue'
db = client.catalogue

### Single document insertion

The `insert_one` is a method for collections. It will insert the given document (i.e., a mutable mapping type, typically a dictionary) and also return an instance of `InsertOneResult`, a special type that contains information about the operation performed.

In [None]:
document = {
    'name': 'Sirius',
    'mv': -1.46,
}

# Accessing and creating the 'stars' collection
insertion = db.stars.insert_one(document)

In [None]:
print(f'Document insertion acknowledged: {insertion.acknowledged}')
print(f'\nDocument _id: {insertion.inserted_id}')

The `acknowledged` property indicates that the insertion was acknowledged, while `inserted_id` gives the value of `_id` for the inserted document (in this case a hash created by MongoDB).

*About acknowledgement:* The database can be spread accross mutiple servers, creating replicas of the data. Normally, once the primary copy has been written or a majority of the replicas have been written, the operation will return and be acknowledge. It is also possible for the acknowledgment to require the writing process to have spread to one or more copies before being acknowledge. It also possible, although not recommended, to have the acknowledgment to be set to zero, essentially telling not to wait for confirmation of the writing operation. This topic is not necessary to fully grasp at a user level, but more info can be found [here](https://www.mongodb.com/docs/manual/reference/write-concern/).

As mentioned before, the `_id` field can be passed explicitly:

In [None]:
document = {
    '_id': 'alf Car',
    'name': 'Canopus',
    'mv': -0.74,
}

insertion = db.stars.insert_one(document)

In [None]:
print(f'Inserted document with _id: {insertion.inserted_id}')

Note, however, that the `_id` **must** be unique:

In [None]:
document = {
    '_id': 'alf Car',
    'name': 'Canopus2',
    'mv': -0.745,
}

# Will fail due to DuplicateKeyError
insertion = db.stars.insert_one(document)

### Multi-document insertion

It is also possible to insert multiple documents in a single operation, making the insertion more efficient for a large numer of existing documents:

In [None]:
documents = [
    {'_id': 'alf Cen A', 'mv': 0.01},
    {'_id': 'alf Lyr', 'mv': 0.03, 'name': 'Vega'},  # Note that the fields need not be consistent between documents
    {'_id': 'bet Cen', 'mv': 0.58}
]

insertions = db.stars.insert_many(documents)

In [None]:
print(f'Document insertion acknowledged: {insertions.acknowledged}')
print(f'Document _id: {insertions.inserted_ids}')

Note that the output this time is of class `InsertManyResult` and the equivalent property to `inserted_id` is now `inserted_ids`.

To insert multiple documents, an iterable (usually a list or a tuple) of mutable mappings (usually dictionaries) must be passed to `insert_many`.

An additional option to keep in mind is the boolean `ordered` (defaults to `True`):
* If `True`, the documents are inserted in the order given and the first failure will stop trying to insert any remaining documents
* If `False`, the order won't matter, will possibly insert the documents in parallel and it will try to insert all documents

In [None]:
documents = [
    {'_id': 'alf Sco', 'mv': 0.91, 'name': 'Antares'},
    {'_id': 'alf Tau', 'mv': 0.86, 'name': 'Aldebaran'},  # Note that the fields need not be consistent between documents
]

db.stars.insert_many(documents, ordered=False)

## Basic queries

To query the collection, the methods `find` and `find_one` are available. The latter will always return the first match, even if more than one document matches the query.

An empty query is equivalent to searching for all data:

In [None]:
all_docs = db.stars.find()
all_docs

The return is of type `Cursor` (unless nothing matches the query, in which case it returns `None`), which is iterable:

In [None]:
for doc in all_docs:
    print(doc)

**Important:** While a `Cursor` is iterable, it is not a Python list. Although documents can be accessed through their index, every use of indexing will actually run a new query with a given `skip` and `limit` (we'll explain these concepts later). Something like the following is extremely inefficient and might even have inconsistent results sometimes: 

In [None]:
# DO NOT RUN SOMETHING LIKE THIS
all_docs = db.stars.find()  # We need to run the find again since we already exhausted the iterator
for i in range(5):
    print(all_docs[i])

The method `find_one` works in the same wat, but it will return a single document (as a dictionary) every time, unless nothing is matched in which case it will return `None`:

In [None]:
db.stars.find_one()

Queries with many objects can take a very long time, so the option `limit` can be used to restrict the maximum number of documents returned (this is only useful for `find`):

In [None]:
docs = db.stars.find(limit=2)
for doc in docs:
    print(doc)

Note that the default is zero, which means no limit is applied.

Another useful option is `skip`, which will skip the first `n` matches. Normally this option is combined with `limit` for pagination purposes:

In [None]:
docs = db.stars.find(skip=2, limit=2)
for doc in docs:
    print(doc)

### Filters

Filters in MongoDB are also constructed as dictionaries. We'll see a few simple queries here, more complex ones will be left for further sections.

For an exact match in a given field, the name of the field must be mapped to the value being searched (this applies to both `find` and `find_one`):

In [None]:
filter_by = {
    'name': 'Vega'
}

db.stars.find_one(filter_by)

More generally, the filter must contain the type of match that is to be applied for the field:

In [None]:
filter_by = {
    'mv': {
        '$gte': 0.5  # Greater than or equal 0.5
    }
}

docs = db.stars.find(filter_by)  # {field: {$match_type: value}}
for doc in docs:
    print(doc)

Multiple restrictions for the same field can be applied simultaneously:

In [None]:
filter_by = {
    'mv': {
        '$gte': 0.5,  # Greater than or equal 0.5
        '$lte': 0.9  # Less than or equal 0.9
    }
}

docs = db.stars.find(filter_by)
for doc in docs:
    print(doc)

Or restrictions over multiple fields simultaneously (documents must match all of them to be selected):

In [None]:
filter_by = {
    'mv': {
        '$gte': 0.5  # Greater than or equal 0.5
    },
    'name': {
        '$regex': '^A'  # Regular expression (starts with A)
    }
}

docs = db.stars.find(filter_by)
for doc in docs:
    print(doc)

The case for the exact match is actually just a convenience for the comparison with `$eq`:

In [None]:
# Both filters are equivalent
filter_by = {
    'name': 'Vega'
}
filter_by = {
    'name': {
        '$eq': 'Vega'
    }
}

db.stars.find_one(filter_by)

Given that not all documents can have necessarily the same fields, it is also possbile to check if the field exists or not with the `$exists` operator:

In [None]:
filter_by = {
    'name': {
        '$exists': True  
    }
}

for doc in db.stars.find(filter_by):
    print(doc)

To join multiple clauses with "or" instead of "and":

In [None]:
filter_by = {
    '$or': [  # NOTE: This is a list, not a dictionary
        {
            'name': 'Vega'
        },
        {
            'name': 'Sirius'
        }
    ]
}

for doc in db.stars.find(filter_by):
    print(doc)

A full list of the query operators can be found [here](https://www.mongodb.com/docs/manual/reference/operator/query/)

### Sorting

The query results can also be sorted using the option `sort` and list of pairs with the field name and either `1` or `-1` for ascending or descending order, respectively:

In [None]:
sorted_docs = db.stars.find(sort=[('mv', 1)])  # Sort by mv in ascending order
for doc in sorted_docs:
    print(doc)

Multiple keys can be given and any documents that have the same value for the primary key will be ordered according to the secondary key:

In [None]:
sorted_docs = db.stars.find(sort=[('name', -1), ('mv', 1)])  # Sort by name in descending order and sort by mv in ascending order any case with same name
for doc in sorted_docs:
    print(doc)

Note that documents without `name` are still present. The sorting assumes in this case an empty string.