# Lecture 9: MQL 1

Gittu George, February 3, 2022

_Attribution: This notebook is developed using materials from DSCI 513 by Arman Seyed-Ahmadi._

In [23]:
from pymongo import MongoClient
import json
import urllib.parse

with open('credentials_mongodb.json') as f:
    login = json.load(f)

username = login['username']
password = urllib.parse.quote(login['password'])
host = login['host']
url = "mongodb+srv://{}:{}@{}/?retryWrites=true&w=majority".format(username, password, host)
client = MongoClient(url)

## MongoDB query language (MQL)

<img src="img/nosql.png" width="400">

([image source](https://dataedo.com/cartoon/it-is-nosql))

```{admonition} See also ...
SQL to MongoDB mapping chart: https://docs.mongodb.com/manual/reference/sql-comparison/
```

As mentioned earlier, there is no standard query language among NoSQL DBMSs. This is because each NoSQL DBMS supports a different data model and obviously no one language can suit all data models.

MongoDB has its own query language known as MongoDB Query Language or MQL (We already saw CQL for neo4j). I will walk you through the usage of MQL in the remainder of this lecture.

### Accessing databases and collections

Here is how we can access databases through different interfaces.

**Compass**:

It's just point and click. I'll demo this in class.

**`mongosh`**:
```js
show dbs
use my_db
```

**`pymongo`**:

```
my_db = client['my_db']
my_db
```

Running the above cell just gives you some information about our connection to the server. We'll learn how to run queries on this connection in a bit. For now, let's see what databases we have:

In [4]:
client.list_database_names()

['sample_airbnb',
 'sample_analytics',
 'sample_geospatial',
 'sample_guides',
 'sample_mflix',
 'sample_restaurants',
 'sample_supplies',
 'sample_training',
 'sample_weatherdata',
 'admin',
 'local']

To access collections withing each database, use the following syntax:

**`mongosh`**:
```js
db.my_collection.method()
```

**`pymongo`**:

```
my_collection = my_db['my_collection']
my_collection
```

Again, some information that we don't need. We will never use the database or collection objects simple like this. For now, let's take a look at the collections inside the `sample_mflix` database:

In [5]:
client['sample_mflix'].list_collection_names()

['theaters', 'comments', 'sessions', 'movies', 'users']

Or alternatively:

In [6]:
client.sample_mflix.list_collection_names()

['theaters', 'comments', 'sessions', 'movies', 'users']

A very important thing to know before using MQL is that

> **Everything in MongoDB is a JSON-like document**

even queries themselves!

### `find`

The main method used for querying documents is the `.find()` method. Here is an example of a query in MongoDB:

**`mongosh`**:

```js
db.movies.find( {title: 'Titanic'} )
```

**`pymongo`**:

In [7]:
client['sample_mflix']['movies'].find( filter={'title': 'Titanic'} )

<pymongo.cursor.Cursor at 0x112d29450>

Using `filter=` is optional in the argument list, but if you remember from Python's Zen advice, "_explicit is better than implicit_".

Well, the above code doesn't do anything because it returns a cursor object which is basically a Python generator. Let's return the first element of this generator:

In [8]:
next(client['sample_mflix']['movies'].find( {'title': 'Titanic'} ))

{'_id': ObjectId('573a1394f29313caabcdf639'),
 'plot': 'An unhappy married couple deal with their problems on board the ill-fated ship.',
 'genres': ['Drama', 'History', 'Romance'],
 'runtime': 98,
 'rated': 'NOT RATED',
 'cast': ['Clifton Webb',
  'Barbara Stanwyck',
  'Robert Wagner',
  'Audrey Dalton'],
 'num_mflix_comments': 0,
 'poster': 'https://m.media-amazon.com/images/M/MV5BMTU3NTUyMTc3Nl5BMl5BanBnXkFtZTgwOTA2MDE3MTE@._V1_SY1000_SX677_AL_.jpg',
 'title': 'Titanic',
 'fullplot': 'Unhappily married and uncomfortable with life among the British upper crust, Julia Sturges takes her two children and boards the Titanic for America. Her husband Richard also arranges passage on the doomed luxury liner in order to let him have custody of their two children. Their problems soon seem minor when the ship hits an iceberg.',
 'languages': ['English', 'Basque', 'French', 'Spanish'],
 'released': datetime.datetime(1953, 7, 13, 0, 0),
 'directors': ['Jean Negulesco'],
 'writers': ['Charles Bra

Or we can pass it to `list()` to materialize the generator entirely:

In [9]:
list(
    client['sample_mflix']['movies'].find( {'title': 'Titanic'} )
)

[{'_id': ObjectId('573a1394f29313caabcdf639'),
  'plot': 'An unhappy married couple deal with their problems on board the ill-fated ship.',
  'genres': ['Drama', 'History', 'Romance'],
  'runtime': 98,
  'rated': 'NOT RATED',
  'cast': ['Clifton Webb',
   'Barbara Stanwyck',
   'Robert Wagner',
   'Audrey Dalton'],
  'num_mflix_comments': 0,
  'poster': 'https://m.media-amazon.com/images/M/MV5BMTU3NTUyMTc3Nl5BMl5BanBnXkFtZTgwOTA2MDE3MTE@._V1_SY1000_SX677_AL_.jpg',
  'title': 'Titanic',
  'fullplot': 'Unhappily married and uncomfortable with life among the British upper crust, Julia Sturges takes her two children and boards the Titanic for America. Her husband Richard also arranges passage on the doomed luxury liner in order to let him have custody of their two children. Their problems soon seem minor when the ship hits an iceberg.',
  'languages': ['English', 'Basque', 'French', 'Spanish'],
  'released': datetime.datetime(1953, 7, 13, 0, 0),
  'directors': ['Jean Negulesco'],
  'writer

> **Note:** `.find( filter={} )` or `.find()` returns every document in the collection.

Note that there is another method `.findOne()` in `mongosh` and `.find_one()` in `pymongo`. This method returns only one document regardless of how many there are, according to the order in which documents are stored on the physical disk. It can be 

### `projection`

Remember what projection meant in SQL? Returning a particular set of columns among all that exist in a table was called projection (of the results onto particular columns).

Projection has a similar meaning in NoSQL: it means explicitly choosing the fields that we are interested in, instead of all fields that are returned by default. This is done by feeding a list of fields to the `projection=` argument, as well as a truthy of falsy value that indicates whether or not that field should be included.

For example, here I return the `title` and `year` fields only from the document in the result:

**`mongosh`**:
```js
db.movies.find( {title: 'Titanic'}, {'title': 1, 'year': 1} )
```

**`pymongo`**:

In [10]:
list(
    client['sample_mflix']['movies'].find(
        filter={'title': 'Titanic'},
        projection={'title': 1, 'year': 1}
    )
)

[{'_id': ObjectId('573a1394f29313caabcdf639'),
  'title': 'Titanic',
  'year': 1953},
 {'_id': ObjectId('573a139af29313caabcefb1d'),
  'title': 'Titanic',
  'year': 1996},
 {'_id': ObjectId('573a139af29313caabcf0d74'),
  'year': 1997,
  'title': 'Titanic'}]

> **Note:** In `pymongo`, you can use `True` instead of `1` and `False` instead of `0`.

> **Note:** In `pymongo`, we need to enclose all field names in single or double quotes (e.g. `'title'` not `title`), otherwise Python would complain because it doesn't recognize those names. In `mongosh`, this is not necessary.

In the above returned documents, note that the primary key field, namely, the `_id` field is always returned by default unless you explicitly exclude it using `{'_id': 0}` or `{'_id': False}`. **This is the only scenario where we might mix up `1`s and `0`s (or `True`s and `False`s) in the projection field.**

In [11]:
list(
    client['sample_mflix']['movies'].find(
        filter={'title': 'Titanic'},
        projection={'_id': 0, 'title': 1, 'year': 1}
    )
)

[{'title': 'Titanic', 'year': 1953},
 {'title': 'Titanic', 'year': 1996},
 {'year': 1997, 'title': 'Titanic'}]

### `sort`

**`mongosh`**:
```js
db.movies.find(<filter>, <projection>).sort( {runtime: 1, year:-1} )
```

**`pymongo`**:

In [12]:
list(
    client['sample_mflix']['movies'].find(
        filter={'title': 'Titanic'},
        projection={'_id': 0, 'title': 1, 'year': 1, 'runtime': 1},
        sort=[('runtime', 1), ('year', -1)]
    )
)

[{'runtime': 98, 'title': 'Titanic', 'year': 1953},
 {'runtime': 173, 'title': 'Titanic', 'year': 1996},
 {'year': 1997, 'title': 'Titanic', 'runtime': 194}]

### `limit`

**`mongosh`**:
```js
db.movies.find({}, {title: 1, _id: 0}).limit(5)
```

**`pymongo`**:

In [13]:
list(
    client['sample_mflix']['movies'].find(
        projection={'title': 1, '_id': 0},
        limit=5
    )
)

[{'title': 'Blacksmith Scene'},
 {'title': 'The Great Train Robbery'},
 {'title': 'The Land Beyond the Sunset'},
 {'title': 'A Corner in Wheat'},
 {'title': 'Winsor McCay, the Famous Cartoonist of the N.Y. Herald and His Moving Comics'}]

### `count` and `count_documents`

**`mongosh`**:
```js
db.movies.find({year:2000}).count()
db.movies.countDocuments()
```

**`pymongo`**:

In [14]:
client['sample_mflix']['movies'].count_documents(filter={'year': 2000})

618

### `skip`

**`mongosh`**:
```js
db.movies.find( filter={title: 'Titanic'}, projection={'title': 1, 'year': 1} ).skip(2)
```

**`pymongo`**:

In [15]:
list(
    client['sample_mflix']['movies'].find(
        filter={'title': 'Titanic'},
        projection={'title': 1, 'year': 1},
        skip=2
    )
)

[{'_id': ObjectId('573a139af29313caabcf0d74'),
  'year': 1997,
  'title': 'Titanic'}]

### `distinct`

**`mongosh`**:
```js
db.movies.distinct( 'title', {title: 'Titanic'} )
```

**`pymongo`**:

In [16]:
list(
    client['sample_mflix']['movies'].find(
        filter={'title': 'Titanic'},
    ).distinct('title')
)

['Titanic']

The `distinct` method here only returns unique **values**, not entire documents. In order to return documents that have unique values in certain fields, we need to use grouping through aggregation pipelines, which we'll learn about in the next lecture.

## `$` and operators in MongoDB

In MongoDB, operators are denoted with a dollar sign `$`. For example, the `$sum` operator which is used in aggregation pipelines, or comparison operators such as `$gte` which is equivalent to `>=` in SQL or Python.

The dollar sign `$` also has another use case: if a field needs to appear as a value, we need to reference that using a `$`. For example, I'll show you later in this lecture that if you want to rename a field in the output, you can have something like this in the projection stage of a pipeline:

```
{'$project': {'duration': '$runtime'}}
```

Had I used `{'duration': 'runtime'}`, the `runtime` would have been interpreted as a literal string, not the actual values in the `runtime` field. You'll also see this used in the `_id` field of a `$group` stage for a similar reason.

## Comparison operators
### `$gt`, `$gte`, `$lt`, `$lte`

These operators have the same meaning as `>`, `>=`, `<`, and `<=` in SQL or Python. For example, while a filter `{'runtime': 200}` would return documents whose runtime is exactly 200 minutes, `{'runtime': {'$gte': 200}}` would return documents whose runtime is greater that 200 minutes.

**Example:** Return the title, runtime, and production year of 5 movies with a runtime of 200 minutes or greater.

In [17]:
list(
    client['sample_mflix']['movies'].find(
        filter={'runtime': {'$gte': 200}},
        projection={'_id': 0, 'title': 1, 'runtime': 1, 'year': 1},
        limit=5
    )
)

[{'runtime': 399, 'title': 'Les vampires', 'year': 1915},
 {'runtime': 240, 'title': 'Napoleon', 'year': 1927},
 {'runtime': 281, 'title': 'Les Misèrables', 'year': 1934},
 {'runtime': 245, 'title': 'Flash Gordon', 'year': 1936},
 {'runtime': 238, 'title': 'Gone with the Wind', 'year': 1939}]

**Example:** How many movies are there with a runtime of 200 minutes or greater?

In [18]:
client['sample_mflix']['movies'].count_documents(filter={'runtime': {'$gte': 200}})

227

### `$ne`

The `$ne` (not equal) operator has the same meaning as `<>` in SQL or `!=` in Python.

**Example:** Find the title and the type of 5 documents in the `movies` collection that are not of type `movie`.

In [19]:
list(
    client['sample_mflix']['movies'].find(
        filter={'type': {'$ne': 'movie'}},
        projection={'_id': 0, 'title': 1, 'type': 1},
        limit=5
    )
)

[{'title': 'The Forsyte Saga', 'type': 'series'},
 {'title': 'Scenes from a Marriage', 'type': 'series'},
 {'title': 'Ironiya sudby, ili S legkim parom!', 'type': 'series'},
 {'title': 'I, Claudius', 'type': 'series'},
 {'title': 'Sybil', 'type': 'series'}]

Note that the `type` field of none of the returned documents is `movie` (all of them are `series` because that's the only other type that exists in the documents of the `movies` collection).

### `$in`, `$nin`

These two operators are equivalent to `IN` and `NOT IN` in SQL, or `in` and `not in` in Python. They are used to check if the value of a field is equal (or not equal) to any value in a given list. What these operators do can also be imitated with `$or` or `$nor`, but these are more concise.

**Example:** Return the title, production year, and the cast of these movies: The Sixth Sense, Imitation Game, The Red Violin, Match Point, Forrest Gump.

In [20]:
list(
    client["sample_mflix"]["movies"].find(
        filter={
            "title": {
                "$in": [
                    "The Sixth Sense",
                    "Imitation Game",
                    "The Red Violin",
                    "Match Point",
                    "Forrest Gump",
                ]
            }
        },
        projection={"_id": 0, "title": 1, "cast": 1, "year": 1},
        limit=5,
    )
)

[{'year': 1994,
  'title': 'Forrest Gump',
  'cast': ['Tom Hanks',
   'Rebecca Williams',
   'Sally Field',
   'Michael Conner Humphreys']},
 {'cast': ['Carlo Cecchi',
   'Irene Grazioli',
   'Anita Laurenzi',
   'Tommaso Puntelli'],
  'title': 'The Red Violin',
  'year': 1998},
 {'year': 1999,
  'title': 'The Sixth Sense',
  'cast': ['Bruce Willis',
   'Haley Joel Osment',
   'Toni Collette',
   'Olivia Williams']},
 {'year': 2005,
  'title': 'Match Point',
  'cast': ['Jonathan Rhys Meyers',
   'Alexander Armstrong',
   'Paul Kaye',
   'Matthew Goode']}]

**Example:** Find the number of movies that are not available in any of these languages: English, French, Italian or German.

In [21]:
client["sample_mflix"]["movies"].count_documents(
    filter={'languages': {'$nin': ['English', 'French', 'German', 'Italian']}}
)

4888

## Can you?

- Use MQL to intereact with MongoDB ?

## Class activity

- Practice MQL.