![alt text](https://i.imgur.com/1WaY7aA.png)

# Lab 2.1.4 
# *Python with MongoDB*

## Introduction to PyMongo

In [8]:
from IPython.core.display import display, HTML
import pymongo
from pymongo import MongoClient
print ('Mongo version ' + pymongo.__version__)

Mongo version 3.8.0


Start the mongod server (if it isn't lready running) by executing  
`mongod`  
at the command prompt. 

In [9]:
client = MongoClient('localhost', 27017)

In [11]:
db = client.test
collection = db.people
#collection.drop()

Create a new database:

In [12]:
mydb = client["mydatabase"]

Confirm that the database exists ... 

- list all databases in your system:

In [13]:
print(client.list_database_names())

ServerSelectionTimeoutError: localhost:27017: [WinError 10061] No connection could be made because the target machine actively refused it

- check for the database by name:

In [None]:
dblist = client.list_database_names()
if "mydatabase" in dblist:
    print("The database exists.")

The new database was not found because mongodb is lazy: the db won't get created until data has been written to it!

Create a collection called "customers" (with object name `mycol`):

In [None]:
mycol = mydb["customers"]

Create a document (i.e. a dictionary) with two name:value items 
("name" = "John", and "address" = "Highway 37") and insert 
it into the "customers" collection: 

In [None]:
mydict = { "name": "John", "address": "Highway 37" }
x = mycol.insert_one(mydict)

Now test for the existence of the database:

In [None]:
print(client.list_database_names())

List all collections in the database:

In [None]:
print(mydb.list_collection_names())

Insert another record in the "customers" collection 
("name" = "Peter", "address" = "Lowstreet 27") 
and return the value of the _id field:

In [None]:
mydict = { "name": "Peter", "address": "Lowstreet 27" }
x = mycol.insert_one(mydict)
print(x.inserted_id)

Given the list of dicts below, insert multiple documents into 
the collection using the insert_many() method:

In [None]:
mylist = [
  { "name": "Amy", "address": "Apple st 652"},
  { "name": "Hannah", "address": "Mountain 21"},
  { "name": "Michael", "address": "Valley 345"},
  { "name": "Sandy", "address": "Ocean blvd 2"},
  { "name": "Betty", "address": "Green Grass 1"},
  { "name": "Richard", "address": "Sky st 331"},
  { "name": "Susan", "address": "One way 98"},
  { "name": "Vicky", "address": "Yellow Garden 2"},
  { "name": "Ben", "address": "Park Lane 38"},
  { "name": "William", "address": "Central st 954"},
  { "name": "Chuck", "address": "Main Road 989"},
  { "name": "Viola", "address": "Sideway 1633"}
]

In [None]:

x = mycol.insert_many(mylist)

Print a list of the _id values of the inserted documents:

In [None]:
print(x.inserted_ids)

Execute the next cell to insert a list of dicts with specified `_id`s:

In [None]:
mylist = [
  { "_id": 1, "name": "John", "address": "Highway 37"},
  { "_id": 2, "name": "Peter", "address": "Lowstreet 27"},
  { "_id": 3, "name": "Amy", "address": "Apple st 652"},
  { "_id": 4, "name": "Hannah", "address": "Mountain 21"},
  { "_id": 5, "name": "Michael", "address": "Valley 345"},
  { "_id": 6, "name": "Sandy", "address": "Ocean blvd 2"},
  { "_id": 7, "name": "Betty", "address": "Green Grass 1"},
  { "_id": 8, "name": "Richard", "address": "Sky st 331"},
  { "_id": 9, "name": "Susan", "address": "One way 98"},
  { "_id": 10, "name": "Vicky", "address": "Yellow Garden 2"},
  { "_id": 11, "name": "Ben", "address": "Park Lane 38"},
  { "_id": 12, "name": "William", "address": "Central st 954"},
  { "_id": 13, "name": "Chuck", "address": "Main Road 989"},
  { "_id": 14, "name": "Viola", "address": "Sideway 1633"}
]
x = mycol.insert_many(mylist)
print(x.inserted_ids)

Now try inserting a new dict with an existing `_id`:

In [None]:
x = mycol.insert_one({ "_id": 14, "name": "Manuel", "address": "Barcelona"})

So, if we want to manage `_id`s in code, we need to be careful!

This returns the first document in the collection:

In [None]:
x = mycol.find_one()
print(x)

Do the same for the document containing "name" = "Hannah":

In [None]:
x = mycol.find_one({"name": "Hannah"})
print(x)

This returns (and prints) all documents in the collection:

In [None]:
for x in mycol.find():
    print(x)

This returns only the name and address fields:

In [None]:
for x in mycol.find({},{ "_id": 0, "name": 1, "address": 1 }):
    print(x)

Print only the `_id` and name fields:

In [None]:
for x in mycol.find({},{ "_id": 1, "name": 1 }):
    print(x)

So, we must explicitly use `"_id": 0` to exclude it, but for other fields we simply omit them from the dict argument.

To include field conditionals in a query, we use `$` operators. This finds addresses starting with "S" or greater:

In [None]:
myquery = { "address": { "$gt": "S" } }
mydoc = mycol.find(myquery)
for x in mydoc:
    print(x)

Here are some more comparison operators:

            $gt, $gte, $eq, $in, $nin, $exists, $and, $or, $not
            
Experiment with these until you understand how to use them.

Now find all docs with an address that begins with "S":  
(HINT: The value for "address" in the argument should be the regex-based dict { "$regex": "^S" }.)

In [None]:
myquery = { "address": { "$regex": "^S" } }
mydoc = mycol.find(myquery)
for x in mydoc:
    print(x)

Sorting can be applied by invoking the Sort() method after the Find() method. Sort the collection by the name field:

In [None]:
mydoc = mycol.find().sort("name")
for x in mydoc:
    print (x)

Now sort in reverse order (HINT: The Sort() method takes an optional second parameter.)

In [None]:
mydoc = mycol.find().sort("name", direction=pymongo.DESCENDING)
for x in mydoc:
    print(x) 

A single record can be deleted by specifying some criterion:

In [None]:
mycol.delete_one({ "address": "Mountain 21" })

Now delete all docs with the 2-digit `Id` values:

In [None]:
mycol.delete_many({ "_id": {"$lt": 15} })
for x in mycol.find():
    print(x)

This would delete all docs:
`x = mycol.delete_many({})`

This would remove the collection:
`mycol.drop()`

Change the first instance of "address" == "Valley 345" to "Canyon 123" using update_one().  
(HINT: The 1st paramater of update_one() is the criterion (query); the 2nd is dict specifying the field to change and its new value.) 

In [None]:
myquery = { "address": "Valley 345" }
newvalues = { "$set": { "address": "Canyon 123" } }
mycol.update_one(myquery, newvalues)
for x in mycol.find():
    print(x)

The limit() method can be applied after the find() method to limit the number of docs returned. Show the first 5 docs:

In [None]:
myresult = mycol.find().limit(5)
for x in myresult:
    print(x)

## PyMongo for Data Science

MongoDB has many more features of interest to developers, but the main focus of a data scientist will be wrangling and munging the data. It may or may not be desirable to do all the data munging in Pandas; for a large, distributed database, it may be imperative to perform aggregation in MongoDB. 

In [14]:
# Ref:  https://rsandstroem.github.io/MongoDBDemo.html

import os
import pandas as pd
import numpy as np

This code creates a database named "test" and populates it from a JSON file using the mongoimport program (executed in the operating systems itself, rather than in Python):

In [15]:
db = client.test
# Drop the collection in case it was previously created:
# collection = db.people
# collection.drop()

In [20]:
os.listdir('../DATA')

['dummyData.json',
 'eshop.db.sqlite',
 'houses.csv',
 'housing-data.csv',
 'names',
 'P12-ListOfOrders.csv',
 'P12-OrderBreakdown.csv']

In [22]:
#os.chdir("data")

In [23]:
pwd()

'C:\\Users\\Beau\\Documents\\DataScience\\Data-Science-Course\\Module 2\\LABS'

In [24]:
# if this does not return 0, execute mongoimport from a commmand window instead:
os.system('mongoimport -d test -c people ../dummyData.json')

1

In [25]:
#os.chdir("..")

In [26]:
pwd()

'C:\\Users\\Beau\\Documents\\DataScience\\Data-Science-Course\\Module 2\\LABS'

In [28]:
db = client.test
collection = db.people
cursor = collection.find().sort('Age',pymongo.ASCENDING).limit(3)
for doc in cursor:
    print (doc)

Here is a small demonstration of the MongoDB aggregation framework. We want to create a table of the number of persons in each country and their average age. To do it we group by country. We extract the results from MongoDB aggregation into a pandas dataframe, and use the country as index.

In [None]:
pipeline = [
        {"$group": {"_id":"$Country",
             "AvgAge":{"$avg":"$Age"},
             "Count":{"$sum":1},
        }},
        {"$sort":{"Count":-1,"AvgAge":1}}
]
aggResult = collection.aggregate(pipeline) # returns a cursor

df1 = pd.DataFrame(list(aggResult)) # use list to turn the cursor to an array of documents
df1 = df1.set_index("_id")
df1.head()

For simple cases one can either use a cursor through find("search term") or use the "$match" operator in the aggregation framework, like this:

In [None]:
pipeline = [
        {"$match": {"Country":"China"}},
]
aggResult = collection.aggregate(pipeline)
df2 = pd.DataFrame(list(aggResult))
df2.head()

Now we can apply all the power of Python libraries to analyse and visualise the data. Here, we will use the folium package to plot markers for the locations of the people we just found in China (click on a marker to see their data):

In [None]:
# Un-comment and execute to install folium pkg (1st time only):
# import sys
# !{sys.executable} -m pip install folium

In [None]:
import folium
print ('Folium version ' + folium.__version__)

world_map = folium.Map(location = [35, 100], zoom_start = 4)
for i in range(len(df2)):
    location = [float(loc) for loc in df2.Location[i].split(',')]
    folium.Marker(location = location, popup = df2.Name[i] + ', age:' + str(df2.Age[i])).add_to(world_map)
    
world_map

## HOMEWORK:


1. Read up on how to perform aggregation in mongoDB. Insert a duplicate record into the collection:
        mydict = { "name": "John", "address": "Highway 37" }
   Now write a command to find docs with a duplicate "name" field (using aggregation) and remove them.  
   Print the collection.

2. Read up on how to apply indexes in mongoDB. Create an index on the "name" and "address" fields in this collection.
   Print the indexes for the collection.



> 
>
>




**© 2019 Data Science Institute of Australia**