### 1. What is MongoDB?



It is a cloud native database-as-a-service (DBaaS) developed by Mongo Inc. It is open sourceish.

- It is a NoSQL database software (that is, for non-relational databases).


- In detail: It supports "document oriented" databases (documents instead of 'rows').


- In more detail: MongoDB databases store collections of JSONs.


- In even more detail: They store collections of BSONs (binary representation of JSONs)


### 2. How to get and use MongoDB



You can get two versions of MongoDB

- Community Edition: free, limited version, easily deployed on AWS, Azure, or GCP.
- Enterprise Edition: commercial version, with extra security and performance features.

https://www.mongodb.com/try


MongoDB provides several libraries, SDKs, and apps for different use cases. 



For basic data analysis and software development work, here are some tools:

- Mongo Atlas: web application that supports querying, exploring, visualizing, and designing data and databases. 
  
- Mongo Compass: standalone application very similar to Atlas.
  
- Mongo Shell: CLI tool.
  
- MongoDB drivers: libraries to use in app code to connect to MongoDB. In Python, the simple option is `pymongo`.

### 3. Practical Examples of Using MQL (Mongo Query Language)



#### 3.1 CRUD

In SQL we can do this


```sql
INSERT INTO movies (title, genre, release_date)
VALUES ('The Shawshank Redemption', 'Drama', '1994-10-14');
```

```sql
UPDATE movies
SET title = 'The Shawshank Redemption', genre = 'Crime', release_date = '1994-10-14'
WHERE id = 1;
```

```sql
DELETE FROM movies
WHERE id = 1;
```



In MQL, this is how it could look:

```mql
db.movies.insertOne({"nombre": "papagayo"})
```
```mql
db.movies.find({"nombre": "papagayo"})
```
```mql
db.movies.updateOne({"nombre": "papagayo"}, {$set: {"year": 1920}})
```
```mql
db.movies.updateMany({"nombre": "papagayo"}, {$set: {"year": 1920}})
```
```mql
db.movies.deleteMany({"nombre": "papagayo"})
```


#### 3.2 Filtering and Sorting

In SQL we do this:

```sql
SELECT start_station_id, end_station_id, tripduration
FROM trips
ORDER BY tripduration DESC;
```


In MQL, we can do this:

```mql
db.trips.aggregate([
    {$project: {"start station id": 1, "end station id":1, "tripduration": 1}}, 
    {$sort: {"tripduration":-1}}
])
```


#### 3.3 Aggregation

In SQL we do this:

```sql
SELECT start_station_id, end_station_id, AVG(tripduration) AS avg_duration, COUNT(*) AS num_trips
FROM trips
GROUP BY start_station_id, end_station_id;
```



In MQL, we can do this:

```mql
db.trips.aggregate([
	{$project: {"start station id": 1, "end station id":1, "tripduration": 1}}, 
	{$group: 
		{_id: {"start station id": "$start station id", "end station id": "$end station id"}, 
		"avg_duration": {$avg: "$tripduration"}, 
		"num_trips": {$sum: 1}
		}
	}
])
```

#### Joins

In SQL we do this:

```sql
SELECT m.*
FROM movies AS m
LEFT JOIN comments AS c ON m._id = c.movie_id
WHERE m.num_mflix_comments > 0
LIMIT 10;
```



In MQL, we can do this:

```mql
db.movies.aggregate([
	{$match: {"num_mflix_comments":{$gt: 0}}}, 
	{$lookup: {"from": "comments", "localField": "_id", "foreignField": "movie_id", "as": "comments"}}, 
	{$limit: 10}
])}
```

In MQL we also do this to do JOIN-like operations

This converts a "students" table into an "assignments" table:

```mql
db.grades.aggregate([{$unwind: "$scores"}])
```

This converts a "students" table into a "classes" table:

```mql
db.grades.aggregate([
    {$group: {"_id": "$class_id", "students": {$push: "$$ROOT"}}},
    {$limit: 1}
])
```


### 4. Data Modeling Philosophy Behind MongoDB



**Performance and scalability are preferred over storage efficiency and perfect data integrity.**



- Flexible document structure instead of rigid schema constraints. 
  - If a document doesn't have a field, that's fine. No need for NULLs.



- "Data that is read together should be stored together". 
  - This means that we default to non-normalized data. **Embedded documents and arrays are preferred to joining tables.**



#### Why?

Relational databases are very good for (at least) two things:
- Storage efficiency (as duplicated data is avoided)
- Data consistency (as duplicate data is avoided and strict constraints are enforced)
  


But, because of this, they are bad for some other things:
- Performance (especially if joins are necessary)
- Flexibility and scalability (because schema constraints are strict for ALL ROWS in a table, and therefore any desired change implies a full migration)



**In this tradeoff, MongoDB takes the opposite approach of relational databases.**

This is popular today because storage is cheap, but developer time and bad user experience are expensive!

### 4. Bonus: How does MongoDB feel for Data Science work?



I'm not sure.



Is MQL easier than SQL? I don't know. It seems like a matter of taste.



Does the lack of data normalization make it easier or harder to understand data?


- On one hand, in relational databases, you are stuck with having to do many joins in EDA, and you often end up with ugly things like columns with 99% nulls.


- On the other hand, in MongoDB, you have little guarantees of what your data will look like. This sounds easy to mismanage, which could result in incomprehensible datasets full of atypical stuff.



What's your feeling? What's your experience?