# FIT3182 - Big data management and processing

# Activity: MongoDB with Python#


Python is an easy-to-code, scalable and powerful programming language. Python can help you to develop a MongoDB application rapidly. This activity will help you to write simple, clear and powerful code that works with MongoDB.

In particular, we will use `PyMongo` that provides an interface to easily access MongoDB from Python. As we learned in the previous weeks, MongoDB uses BSON-styled documents. The syntax of using `PyMongo` is so similar to the syntax of commands on the `mongo` shell that the learning curve of this activity will be easy.

**In this activity, we will perform the following tasks:**
- Introduction to working with MongoDB and Python
- Practical example working with MongoDB and Python
- For more information, you can refer to: 
    * https://docs.mongodb.com/getting-started/python/client/
    * http://api.mongodb.com/python/current/tutorial.html

Let's get started!

## MongoDB Connection ##

#### Prerequsite
First, let's make sure that we have the package, `PyMongo`, installed. Open a jupyter notebook, and run the following code:
```
import pymongo
```

If there is no exception, then this package has been install. Otherise, you need to install `PyMongo`. In this case, your tutor will give you an instruction on its installation.  Also, make sure that the following is running: the mongoDB server (i.e. `mongod`) and the mongoDB shell (`mongo`)

#### Making a Connection with MongoDB
<span style="color:red">Firstly, start your mongodb Docker container from the last lab.</span>  
Then, we can use one of the following method to make such a connection:

In [3]:
!pip install pymongo
# Need do whenever you do to use latest version



In [4]:
import pymongo
from pymongo import MongoClient

# Method 1: specify the host and port explicitly
client = MongoClient('your host ip address', 27017) 

# Method 2: use the MongoDB URI format
#client = MongoClient('mongodb://your host IP address:27017/') 
client = MongoClient('mongodb://10.192.45.36:27017/')
client

MongoClient(host=['10.192.45.36:27017'], document_class=dict, tz_aware=False, connect=True)

## Getting a Database and Collection

#### Getting a Database 
A single instance of MongoDB, the variable `client`, can manage multiple databases. When we access a database, use the following syntax:

In [7]:
result = client.list_database_names()
result

db = client.fit3182_db # assume that we use the database fit3182_db that we created in the previous lab.
db
#db = client['fit3182_db'] # another way of getting a database

Database(MongoClient(host=['10.192.45.36:27017'], document_class=dict, tz_aware=False, connect=True), 'fit3182_db')

#### Getting a Collection
Now we can access a collection via the following way:

In [8]:
print(db.list_collection_names())

collection = db.FIT_COMPLEX # FIT_COMPLEX: we created and played with this collection in the previous lab
#collection = db['FIT_COMPLEX'] # another way of getting a collection

['FIT_COMPLEX', 'FIT']


**NOTE**: If the databases and collections did not exist previously, these will be created when the first document is inserted into them. Can you remember how we created a database and collection on the `mongo` shell? Yes the same policy is applied when we are doing the same tasks on `mongo`.

## CRUD Operations using Python
We learned that data in MongoDB is represented (and stored) using JSON-style documents. In Python, we use built-in "dictionaries" to represent documents in MongoDB. 

For this activity, we will create another collection: `montours` (Monash Tour) within the database `fit3182_db`.

### Create

Let's create and get a collection:

In [9]:
montours = db.montours

Let's now create a new document:

In [10]:
newTour = {"package":"MonITTour",
    "name":"Monash IT Tour",
    "length":1,
    "price":100,
    "location":"Caulfield",
    "organiser" : {
        "faculty": "FIT",
        "person" : "John Smith"
    },
    "tags":["Monash", "FIT", "Caulfield"]}

### Insert
To insert a document, we can use the `insert_one()` method. For example:

In [11]:
result = montours.insert_one(newTour)

The operation returns an `InsertOneResult` object, which includes an attribute `inserted_id` that contains the `_id` of the inserted document. Let's print the `_id`.

In [12]:
print(result.inserted_id)

66272ae1bb08c8426703fbf6


Let's print the documents in `montours` in a more elaborated way. Remember we used `pretty()` on the `mongo` shell to print a document in a more readable way. Similarly, we can use `pprint` in Python:

In [13]:
from pprint import pprint #"Pretty Printer"

cursor = montours.find({})
for document in cursor: 
    pprint(document)

{'_id': ObjectId('66272ae1bb08c8426703fbf6'),
 'length': 1,
 'location': 'Caulfield',
 'name': 'Monash IT Tour',
 'organiser': {'faculty': 'FIT', 'person': 'John Smith'},
 'package': 'MonITTour',
 'price': 100,
 'tags': ['Monash', 'FIT', 'Caulfield']}


Note that the `_id` field has been automatically added that must be unique across the collection. 

Let’s insert multiple documents at one time. We can perform `insert_many()`:

In [14]:
newTours = [
{
    "package":"MonArtTour",
    "name":"Monash Art Tour",
    "length":2,
    "price":50,
    "location":"Caulfield",
    "organiser" : {
            "faculty": "Faculty of Arts",
            "person" : "Linda Adams"
    },    
    "tags":["Monash", "Art", "Caulfield"]
},
{
    "package":"MonITTour",
    "name":"Monash IT Tour at Clayton",
    "length":3,
    "price":50,
    "location":"Clayton",
    "organiser" : {
            "faculty": "FIT",
            "person" : "Josh Gange"
    },
    "tags":["Monash", "FIT", "Clayton"]
}]
result = montours.insert_many(newTours)

The above operation returns an `InsertManyResult` object, which includes an attribute `inserted_ids` that contains the list of ids of the inserted document. 

<font color='blue'>
**Exercise**: Let's print the ids.
</font><br>

**Solution and Expected Output**: 

In [15]:
result.inserted_ids

[ObjectId('66272c67bb08c8426703fbf7'), ObjectId('66272c67bb08c8426703fbf8')]

<font color='blue'>
**Exercise**: Let's print all documents in the montours collection.
</font><br>

### Update
Now let's focus on how we can update documents in MongoDB. We can use `update_one()` or `update_many()` to update documents of a collection. The `update_one()` method updates a single document while `update_many()` can update all documents that match the criteria. But we cannot update the `_id` field.

The following command updates the first document whose `tourPackage` equal to `MonITTour`. In the command, we  use `$set` to update the `tourName` field. After running the command, let's print the documents that match the condition.

In [16]:
result = montours.update_one( \
    {"package":"MonITTour"}, \
    { \
    "$set": {\
        "name":"Monash IT Faculty Tour"} \
    }
)

# OR this
# result = montours.update_one({"package":"MONITTour"},{"$set":{"name":"Monash IT Faculty Tour"}})

Let's print out how many documents were updated using the following:

In [17]:
result.matched_count

1

In [18]:
result.raw_result

{'n': 1, 'nModified': 1, 'ok': 1.0, 'updatedExisting': True}

In [19]:
result.modified_count

1

#### Upsert in MongoDB 
Upsert = Update + Insert
- https://www.geeksforgeeks.org/upsert-in-mongodb/

#### Update embedded documents
We can also update embedded documents. To update a field within an embedded document, we can use the "dot" notation. For example, if you want to update `person` in the embedded `organiser` document, you can update this field through: `organiser.person`.

<font color='blue'>
**Exercise**: Let's update the field 'person' of a tour with 'package' equal to 'MonArtTour' as 'Katherine McDonald'. Check the result.
</font><br>

**Solution and Expected Output**: 
```
result = montours.update_one( \
    {"package":"MonArtTour"}, \
    { \
    "$set": {\
        "organiser.person":"Katherine McDonald"} \
    })
result.matched_count
```

We can also update multiple documents using the `update_many()` method. 

**Exercise:**
<font color='blue'>
**Exercise**: Let's update the field 'name' of tours with 'package' equal to 'MonITTour' as 'Exciting journey to Monash FIT'. Check the result.
</font><br>

**Solution and Expected Output**: 
```
result = montours.update_many( \
    {"package":"MonITTour"}, \
    { \
    "$set": {\
        "name":"Exciting journey to Monash FIT"} \
    })
```

In [20]:
result = montours.update_one( \
    {"package":"MonArtTour"}, \
    { \
    "$set": {\
        "organiser.person":"Katherine McDonald",
        "organiser.faculty":"AIYOYO"} \
    })
result.matched_count

1

#### Replace a Document
Let's now replace an entire document except for the `_id` field. We can use the method `replace_one()`. For example, if we want to replace the first matching document in `montours` that matches the filter - "its name is 'Exciting journey to Monash FIT'", using the "replacement document" that only has `name` with `Monash FIT tour`, use the following:

```
result = montours.replace_one( \
    {"name":"Exciting journey to Monash FIT"}, \
    {"name":"Monash FIT tour"})
```


<font color='blue'>
**Exercise**: Check the result printing the replaced document and printing all documents in `montours`.
</font><br>


In [21]:
# Check the result (print the updated document)
updated_document = montours.find_one({"name": "Monash FIT tour"})
print("Updated Document:")
print(updated_document)

# Print all documents in the collection
all_documents = montours.find()
print("\nAll Documents:")
for doc in all_documents:
    print(doc)

Updated Document:
None

All Documents:
{'_id': ObjectId('66272ae1bb08c8426703fbf6'), 'package': 'MonITTour', 'name': 'Monash IT Faculty Tour', 'length': 1, 'price': 100, 'location': 'Caulfield', 'organiser': {'faculty': 'FIT', 'person': 'John Smith'}, 'tags': ['Monash', 'FIT', 'Caulfield']}
{'_id': ObjectId('66272c67bb08c8426703fbf7'), 'package': 'MonArtTour', 'name': 'Monash Art Tour', 'length': 2, 'price': 50, 'location': 'Caulfield', 'organiser': {'faculty': 'AIYOYO', 'person': 'Katherine McDonald'}, 'tags': ['Monash', 'Art', 'Caulfield']}
{'_id': ObjectId('66272c67bb08c8426703fbf8'), 'package': 'MonITTour', 'name': 'Monash IT Tour at Clayton', 'length': 3, 'price': 50, 'location': 'Clayton', 'organiser': {'faculty': 'FIT', 'person': 'Josh Gange'}, 'tags': ['Monash', 'FIT', 'Clayton']}


### Delete

Use `delete_one()` or `delete_many()` to delete documents from a collection. These methods take a condition to match the documents to be deleted. The syntax is simple. For example, let's delete one document that matches the following condition:

In [None]:
result = montours.delete_one({"package":"MonArtTour"})

The above command deletes the first document whose package is `MonArtTour`. We can check whether document(s) have been deleted or not using:
```
result.deleted_count
```

If we want all documents matching the condition, then use the `delete_many()` method. If you use `delete_many()` without any parameter, all documents from the specified collection will be deleted.

<font color='blue'>
**Exercise**: Delete all documents whose name is 'Monash FIT tour'. Check the result.
</font><br>

#### Drop
To delete all documents from a collection, it may be more efficient to drop the entire collection, including the indexes. Use the `drop()` method to drop a collection, including any indexes. 
```
yourCollectionName.drop()
```

Also, we can change the name of a collection using `rename()`. For example, if we want to change the name of `montours` with `monashTours`:
```
montours.rename("monashTours")
```
Check the result on the `mongo` shell.

### Read

Basically, to retrieve documents in a collection, we can use `find()`. If there is no document in `montours`, insert the above three documents into the collection again. 

The use of `find()` is very similar to the usage in the mongo shell. For example, run the following to see how to use `find()` in Python:

In [None]:
allTours = montours.find()
for tour in allTours:
    pprint(tour)

Now let's find documents with a matching condition. Look at and run the code below to understand how we can specify such a condition, and print the result using `print()` and `pprint()`.

In [None]:
# Find the package having 'MonITTour'
results = montours.find({"package":"MonITTour"})
for doc in results:
    print(doc)

In [None]:
# We learned the usage of pprint.
results = montours.find({"package":"MonITTour"})
for doc in results:
    pprint(doc)

#### find_one
The `find_one()` method returns a single document matching a query. Here, we use it to get the first document.

In [None]:
pprint(montours.find_one({"package":"MonITTour"}))

#### count
To know how many documents match a query, we can perform the `count_documents()` operation. 

In [None]:
montours.count_documents({}) # count of all of the documents in a collection:

<font color='blue'>
**Exercise**: Count the number of documents whose package is 'MonITTour'.
</font><br>

**Solution and Expected Output**: 
```
montours.count_documents({"package":"MonITTour"})
```

In [None]:
montours.count_documents({"package":"MonITTour"})

#### range queries
As we went through in our previous tutorials, MongoDB supports many different types of advanced queries. As an example, let's perform a query where we limit results to tours whose `price` is less than or equal to `50`, but also sort the results by `price`:

In [None]:
results = montours.find({"price":{"$lte":50}}).sort("price")
for result in results:
    pprint(result)

<font color='blue'>
**Exercise**: Let's do a query about finding tours with a field `price` greater than or equal to 50. Sort the results by the field'length'.
</font><br>

**Solution and Expected Output**: 
```
results = montours.find({"price":{"$gte":50}}).sort("length")
for result in results:
    pprint(result)
```

#### Mulitiple query conditions
We can also combine multiple query conditions via logical conjunction (AND) and logical disjunctions (OR).

Let's specify a logical conjunction for multiple query conditions separated by a comma in the conditions document. For example, if we want find tours whose package is `MonITTour` and price is greater than or equal to `100`:

In [None]:
results = montours.find({"package":"MonITTour", "price":{"$gte":100}})
for result in results:
    pprint(result)

Now let's see how to specify a logical disjunction using the `$or` query operator. For example, if we want find tours whose package is `MonArtTour` or price is greater than or equal to `50`:

In [None]:
results = montours.find({"$or":[{"package":"MonArtTour"}, {"price":{"$gte":100}}]})
for result in results:
    pprint(result)

#### Sort the query results
To specify an order for the result set, append the sort() method to the query. Pass to sort() method a document which contains the field(s) to sort by and the corresponding sort type, e.g. `pymongo.ASCENDING` for ascending and `pymongo.DESCENDING` for descending.

Let's make a query where we limit results to tours whose `tourPrice` is greater than `1000`. Sort the results by `tourPrice` in ascending order and `tourName` by descending order.

In [None]:
results = montours.find({"tourPrice":{"$gt":1000}}). \
sort([("tourPrice", pymongo.ASCENDING), {"tourName", pymongo.DESCENDING}])
for result in results:
    pp.pprint(result)

#### Indexing
Let's demonstrate how to create a unique index on a key that rejects a document whose value for that key already exists in the index.

For this exercise, we will create a single key ascending index on the key `organiser.person`.

In [None]:
result = montours.create_index([('organiser.person', pymongo.ASCENDING)], unique=True)

Notice that we now have two indexes in `montours`: (1) the index on `_id` that MongoDB creates automatically, and (2) the index on `organiser.person`.

<font color='blue'>
**Exercise**: Create a document where its 'organiser.person' is 'Linda Adams'. Then, insert it into the 'montour' collection. What happended? Can you identify why this ocurred an error?
</font><br>

Also, we can check what fields have indexes via the following:

In [None]:
sorted(list(montours.index_information()))

#### Aggregate
In the previous tutorials, we learned that MongoDB can perform aggregation operations, such as grouping by a specified key and evaluating the total, count or average for each distinct group. In Python, we can also use aggregation using `aggregate()`. 

Let's count how many tours are in each package (i.e. the `package` field): 

```
results = montours.aggregate([{"$group":{"_id":"$package", "count":{"$sum":1}}}])
for document in results:
    pprint(document)
```

<font color='blue'>
**Exercise**: Find the count of tours in each organiser faculty. 
</font><br>

**Solution and Expected Output**:
```
results = montours.aggregate([{"$group":{"_id":"$organiser.faculty", "count":{"$sum":1}}}])
for document in results:
    pprint(document)
```

<font color='blue'>
**Exercise**: What's the average price of the tours for each package? We will keep the count there as well.
</font><br>
**Solution and Expected Output**:
```
results = montours.aggregate([{"$group":{"_id":"$price", "avg":{"$avg":"$price"}, "count":{"$sum":1}}}])
for document in results:
    pprint(document)
```

## Joining Two Collections 
In our previous tutorial, we learned how to join two collection in `mongo`. Pymongo also provides the same function using `aggregate()`. To demonstrate it, we use the same collections that we used in the previous week used to explain the join operation. 

Let's first print documents in the `users` collection and the `units` collection.

In [None]:
users = db.users
units = db.units
results = users.find()
for document in results:
    pprint(document)

results = units.find()
for document in results:
    pprint(document)

Now let's apply a join operation using `$lookup` within `aggregation()`. Run the following code:

In [None]:
results = users.aggregate([{
"$lookup":
    {
        "from": "units", # name of the collection to join with
        "localField": "sid", # field in the 'users' collection to match on
        "foreignField" : "sid", # field in the 'units' collection to match on
        "as": "completed_units" # where to store your output
    }
}])
for document in results:
    pprint(document)

<font color='blue'>
**Exercise**: Write Python code corresponding the following code in the `mongo` shell:
```
db.getCollection('units').aggregate([
   {
      $unwind: "$sid"
   },
   {
    $lookup:
        {
            from: "users",
            localField: "sid",
            foreignField : "sid",
            as: "completed_units"
        }
    }
]).pretty()
```
</font><br>

**Solution and Expected Output**:
```
results = units.aggregate([
{"$unwind": "$sid"},
{
"$lookup":
    {
    "from": "users",
    "localField": "sid",
    "foreignField": "sid",
    "as": "completed_units"
    }
}])
for document in results:
    pprint(document)
```

## Text processing
Now, let's apply CRUD operations in MongoDB on a text file using `PyMongo`.  This activity will help you apply what you've learned as to CRUD operations and provide you with a good opportunity to understand how to build an application using PyMongo.

Let's get started!

The following code reminds you of how to get a MongoDB database and a collection in Python.

In [None]:
from pymongo import MongoClient
from pprint import pprint

# Create a MongoClient
client = MongoClient()

# We will use the database: fit3182
db = client.fit3182_db

# The collection name is units
units = db.units

# Finally, we'll finish off by adding the main function
if __name__ == "__main__":
   print("This is main function")

The running outcome of the above coude should be:
```
This is main function
```

Before we get to the database interaction, let's read the lines from the text file "unit_synopsis.txt". In this file, we see the combination of a unit code and its synopsis. Use the following code to read the data:
```
with open('unit_synopsis.txt') as file:
    synopsis_set = file.readlines()
    synopsis_set = [line.strip() for line in synopsis_set] 
```

In the above code, `strip()` is used remove the leading and trailing white space in each line.

### Create
Let's create data in the new collection `synopsis`. We now define a function: `create_synopsis()` which extracts the synopsis for units. The unit code and its synopsis will be inserted into the `synopsis` collection.

Since we need to know the list of unit codes avialable, we assume that such list exists in an array, `unit_code_list`:
```
unit_code_list = ['FIT3182', 'FIT9131', 'FIT9132']
```

```
def create_synopsis(myCol):
    unit_code = ""

    for line in synopsis_set:
        if line in unit_code_list:
            unit_code = line
            continue
        elif unit_code == "":
            continue
        else:
            synopsis_record = {
                'unit_code':unit_code,
                'synopsis':line
            }
            myCol.insert(synopsis_record)
myCol = db.synopsis
create_synopsis(myCol)
```

Run the above code, and check whether the new collection `synopsis` contains records.

### Update
Now let's focus on update.  We will update the lowercase names of all keywords to capitalized words on the `synopsis` field. Assume that the keywords are provided via a variable `keyword_set`:
```
keyword_set = ['software', 'database', 'programming', 'development', 'reasoning']
```

For this, we will use `regex` and replace lowercase keywords with capital letters, and update the collection. We will use the following code:

```
def update_synopsis(myCol):
    for keyword in keyword_set:
        for line in myCol.find({'synopsis': {'$regex':keyword}}):
            new_synopsis = line['synopsis'].replace(keyword, keyword.upper())
            myCol.update_many( {'_id':line["_id"]}, {'$set': {"synopsis":new_synopsis}})
myCol = db.synopsis
update_synopsis(myCol)
```
Run the above code, and check with the mongo shell to see whether update has been successfully working.

### Delete
Now let's move on to the delete operation.

<font color='blue'>
**Exercise**: Delete all documents in which the 'synopsis' field having 'SOFTWARE' or 'DATA'. You need to define a function, 'delete_synopsis()'. Use 'regex' to define searching patterns. Check the result.
</font><br>


**Solution and Expected Output**:
```
def delete_synopsis(myCol):
    pattern = "SOFTWARE|DATA"
    myCol.remove({'synopsis': {'$regex': pattern}})
```

**Congratulations on finishing this activity!**