In [12]:
from IPython.core.display import HTML
def css_styling():   
    styles = open("styles/custom.css", "r").read() 
    return HTML(styles) 

# 1 Introduction to MongoDB
<small>This introduction is partially inspired on the notes of Alberto Negron's [blog](http://altons.github.io/python/2013/01/21/gentle-introduction-to-mongodb-using-pymongo/)</small>

MongoDB is a document-oriented database, part of the NoSQL family of database systems. MongoDB stores structured data as JSON-like structures. From a pythonic point of view it is like storing dictionary data structures. One of its main feature is its schema-less feature, i.e. it supports dynamic schemas. A schema in a relational database informally refer to the structure of the data it stores, i.e. what kind of data, which tables, which relations, etc.

## Connecting with a MongoDB on the cloud

+ First, create an account in [https://cloud.mongodb.com/](https://cloud.mongodb.com/)
+ Build a free cluster instance of 0,5Gb
+ Create a Dabase User under Database Access, and connect to it following next code
+ Install pymongo + dnspython

`pip install "pymongo[srv]"`

In [None]:
# One way...
import pymongo

with open('my_credentials.txt', 'r') as f:
    connect_order = f.readlines()[0].rstrip()
conn = pymongo.MongoClient(connect_order)
db = conn['test']

print("Connect order", connect_order)

In [46]:
# import pymongo
# try:
#     if 'conn' in globals():
#         conn.close()
#         print("Closing")
    
#     with open("credentials.txt", 'r') as f:
#         [name,password,url,dbname]=f.read().splitlines()
#         connect_order: str = "mongodb+srv://{}:{}@{}".format(name,password,url)
#     conn=pymongo.MongoClient(connect_order)
#     print ("Connected successfully!!!")
    
# except pymongo.errors.ConnectionFailure as e:
#     print ("Could not connect to MongoDB: %s" % e) 
# conn
# db = conn[dbname]
# db

# print(connect_order)

You can check your mongoDB database in this website:
[https://cloud.mongodb.com/](https://cloud.mongodb.com/)

### Connecting with a MongoDB database in localhost

First of all let us configure the MongoDB system.

+ Download mongoDB.

https://www.mongodb.com/download-center/community

+ Create data directory:

`sudo mkdir -p /data/db`
+ Check that the server works

`sudo ./mongod --nojournal &`

+ Check the connection to the server: 

in another terminal write `mongo` , check that it does not raise any error and exit the console.
+ Close the mongo daemon (mongod). 
        You may have to kill mongod with 
        
`killall mongod`

        and remove the lock on the daemon, 
`rm /data/db/mongod.lock`

+ Let us configure a little the data base by configuring the path of the data storage and log files. Create a [mongo.conf](./mongo.conf) file such as the one provided  and start the server using the following command:

`mongod --config=./mongo.conf --nojournal &`
        
+ Install pymongo 

`pip install pymongo`

In [47]:
# import pymongo

# # Connection to Mongo DB
# try:
#     conn=pymongo.MongoClient()
#     print ("Connected successfully!!!")
# except pymongo.errors.ConnectionFailure as e:
#     print ("Could not connect to MongoDB: %s" % e) 
# conn

## Accessing to / Creating database

We can **create** or **access to** a database using attribute access <span style = "font-family:Courier;"> db = conn.name_db</span> or dictionary acces <span style = "font-family:Courier;"> db = conn[name_db]</span>.

In [None]:
#Create a database using db = conn.name_db or dictionary access db = conn['name_db']
db = conn['ADS']
print (db)
conn.list_database_names()
#Empty databases do not show!

A database stores a **collection**. A collection is a group of documents stored in MongoDB, and can be thought of as the equivalent of a table in a relational database. Getting a collection in PyMongo works the same as getting a database:

In [49]:
collection = db.edu
db.list_collection_names()
#Empty collections do not show!

['edu']

MongoDB stores structured data as JSON-like documents, using dynamic schemas (called BSON), rather than predefined schemas. An element of data is called a document, and documents are stored in collections. One collection may have any number of documents.

Compared to relational databases, we could say collections are like tables, and documents are like records. But there is one big difference: every record in a table has the same fields (with, usually, differing values) in the same order, while each document in a collection can have completely different fields from the other documents.

All you really need to know when you're using Python, however, is that documents are Python dictionaries that can have strings as keys and can contain various primitive types (int, float,unicode, datetime) as well as other documents (Python dicts) and arrays (Python lists).

To insert some data into MongoDB, all we need to do is create a dict and call .insert() on the collection object. Let us exemplify this process by getting some DatFrame and storing it in the collection.

In [50]:
import pandas as pd
df = pd.read_csv('./educ_figdp_1_Data.csv',na_values=':')
df.head(5)

Unnamed: 0,TIME,GEO,INDIC_ED,Value,Flag and Footnotes
0,2000,European Union (28 countries),Total public expenditure on education as % of ...,,
1,2001,European Union (28 countries),Total public expenditure on education as % of ...,,
2,2002,European Union (28 countries),Total public expenditure on education as % of ...,5.0,e
3,2003,European Union (28 countries),Total public expenditure on education as % of ...,5.03,e
4,2004,European Union (28 countries),Total public expenditure on education as % of ...,4.95,e


In [51]:
df.to_dict("records")

[{'TIME': 2000,
  'GEO': 'European Union (28 countries)',
  'INDIC_ED': 'Total public expenditure on education as % of GDP, for all levels of education combined',
  'Value': nan,
  'Flag and Footnotes': nan},
 {'TIME': 2001,
  'GEO': 'European Union (28 countries)',
  'INDIC_ED': 'Total public expenditure on education as % of GDP, for all levels of education combined',
  'Value': nan,
  'Flag and Footnotes': nan},
 {'TIME': 2002,
  'GEO': 'European Union (28 countries)',
  'INDIC_ED': 'Total public expenditure on education as % of GDP, for all levels of education combined',
  'Value': 5.0,
  'Flag and Footnotes': 'e'},
 {'TIME': 2003,
  'GEO': 'European Union (28 countries)',
  'INDIC_ED': 'Total public expenditure on education as % of GDP, for all levels of education combined',
  'Value': 5.03,
  'Flag and Footnotes': 'e'},
 {'TIME': 2004,
  'GEO': 'European Union (28 countries)',
  'INDIC_ED': 'Total public expenditure on education as % of GDP, for all levels of education combined',


In [52]:
#insert documents in the collection
collection.insert_many(df.to_dict("records"));

In [53]:
#Check that we have a non empty collection.
db.list_collection_names()

['edu']

To recap, we have databases containing collections. A collection is made up of documents. Each document is made up of fields.

### Retrieving data

In [54]:
collection.find_one() #Returns first document in the collection

{'_id': ObjectId('6336f2e48155fd577852080e'),
 'TIME': 2000,
 'GEO': 'European Union (28 countries)',
 'INDIC_ED': 'Total public expenditure on education as % of GDP, for all levels of education combined',
 'Value': nan,
 'Flag and Footnotes': nan}

To get more than a single document as the result of a query we use the find() method. find() returns a Cursor instance, which allows us to iterate over all matching documents.


In [55]:
collection.find()

<pymongo.cursor.Cursor at 0x7f9c5d41c5b0>

In [56]:
[d for d in collection.find()] 

[{'_id': ObjectId('6336f2e48155fd577852080e'),
  'TIME': 2000,
  'GEO': 'European Union (28 countries)',
  'INDIC_ED': 'Total public expenditure on education as % of GDP, for all levels of education combined',
  'Value': nan,
  'Flag and Footnotes': nan},
 {'_id': ObjectId('6336f2e48155fd577852080f'),
  'TIME': 2001,
  'GEO': 'European Union (28 countries)',
  'INDIC_ED': 'Total public expenditure on education as % of GDP, for all levels of education combined',
  'Value': nan,
  'Flag and Footnotes': nan},
 {'_id': ObjectId('6336f2e48155fd5778520810'),
  'TIME': 2002,
  'GEO': 'European Union (28 countries)',
  'INDIC_ED': 'Total public expenditure on education as % of GDP, for all levels of education combined',
  'Value': 5.0,
  'Flag and Footnotes': 'e'},
 {'_id': ObjectId('6336f2e48155fd5778520811'),
  'TIME': 2003,
  'GEO': 'European Union (28 countries)',
  'INDIC_ED': 'Total public expenditure on education as % of GDP, for all levels of education combined',
  'Value': 5.03,
  'Fl

If we just want to know how many documents match a query we can perform a count() operation instead of a full query. We can get a count of all of the documents in a collection:

In [57]:
collection.count_documents({})

2688

### Basic queries

Querying in pymongo uses .find() 

In [58]:
[d for d in collection.find({"TIME":2009})]

[{'_id': ObjectId('6336f2e48155fd5778520817'),
  'TIME': 2009,
  'GEO': 'European Union (28 countries)',
  'INDIC_ED': 'Total public expenditure on education as % of GDP, for all levels of education combined',
  'Value': 5.38,
  'Flag and Footnotes': 'e'},
 {'_id': ObjectId('6336f2e48155fd5778520823'),
  'TIME': 2009,
  'GEO': 'European Union (27 countries)',
  'INDIC_ED': 'Total public expenditure on education as % of GDP, for all levels of education combined',
  'Value': 5.38,
  'Flag and Footnotes': 'e'},
 {'_id': ObjectId('6336f2e48155fd577852082f'),
  'TIME': 2009,
  'GEO': 'European Union (25 countries)',
  'INDIC_ED': 'Total public expenditure on education as % of GDP, for all levels of education combined',
  'Value': 5.41,
  'Flag and Footnotes': 'e'},
 {'_id': ObjectId('6336f2e48155fd577852083b'),
  'TIME': 2009,
  'GEO': 'Euro area (18 countries)',
  'INDIC_ED': 'Total public expenditure on education as % of GDP, for all levels of education combined',
  'Value': 5.31,
  'Flag

Observe that it finds exact matches (including data type)

In [41]:
[d for d in collection.find({"TIME":"2009"})]

[]

In [42]:
[d for d in collection.find({"GEO":"Spain"})]

[{'_id': ObjectId('6336f2e48155fd57785208c2'),
  'TIME': 2000,
  'GEO': 'Spain',
  'INDIC_ED': 'Total public expenditure on education as % of GDP, for all levels of education combined',
  'Value': 4.28,
  'Flag and Footnotes': nan},
 {'_id': ObjectId('6336f2e48155fd57785208c3'),
  'TIME': 2001,
  'GEO': 'Spain',
  'INDIC_ED': 'Total public expenditure on education as % of GDP, for all levels of education combined',
  'Value': 4.24,
  'Flag and Footnotes': nan},
 {'_id': ObjectId('6336f2e48155fd57785208c4'),
  'TIME': 2002,
  'GEO': 'Spain',
  'INDIC_ED': 'Total public expenditure on education as % of GDP, for all levels of education combined',
  'Value': 4.25,
  'Flag and Footnotes': nan},
 {'_id': ObjectId('6336f2e48155fd57785208c5'),
  'TIME': 2003,
  'GEO': 'Spain',
  'INDIC_ED': 'Total public expenditure on education as % of GDP, for all levels of education combined',
  'Value': 4.28,
  'Flag and Footnotes': nan},
 {'_id': ObjectId('6336f2e48155fd57785208c6'),
  'TIME': 2004,
  'GE

In [43]:
[d for d in collection.find({"GEO":"SPAIN"})]

[]

Operations include *gt* (greater than), *gte* (greater than equal), *lt* (lesser than), *lte* (lesser than equal), *ne* (not equal), *nin* (not in a list), *regex* (regular expression), *exists*, *not*, *or*, *and*, etc. Let us see some examples:

In [59]:
[d for d in collection.find({"TIME":{"$gte":2009}})]

[{'_id': ObjectId('6336f2e48155fd5778520817'),
  'TIME': 2009,
  'GEO': 'European Union (28 countries)',
  'INDIC_ED': 'Total public expenditure on education as % of GDP, for all levels of education combined',
  'Value': 5.38,
  'Flag and Footnotes': 'e'},
 {'_id': ObjectId('6336f2e48155fd5778520818'),
  'TIME': 2010,
  'GEO': 'European Union (28 countries)',
  'INDIC_ED': 'Total public expenditure on education as % of GDP, for all levels of education combined',
  'Value': 5.41,
  'Flag and Footnotes': 'e'},
 {'_id': ObjectId('6336f2e48155fd5778520819'),
  'TIME': 2011,
  'GEO': 'European Union (28 countries)',
  'INDIC_ED': 'Total public expenditure on education as % of GDP, for all levels of education combined',
  'Value': 5.25,
  'Flag and Footnotes': 'e'},
 {'_id': ObjectId('6336f2e48155fd5778520823'),
  'TIME': 2009,
  'GEO': 'European Union (27 countries)',
  'INDIC_ED': 'Total public expenditure on education as % of GDP, for all levels of education combined',
  'Value': 5.38,
  

In [60]:
substring = r'Euro'
reg = substring
[(i["GEO"]) for i in collection.find({"GEO":{"$regex":reg}})]

['European Union (28 countries)',
 'European Union (28 countries)',
 'European Union (28 countries)',
 'European Union (28 countries)',
 'European Union (28 countries)',
 'European Union (28 countries)',
 'European Union (28 countries)',
 'European Union (28 countries)',
 'European Union (28 countries)',
 'European Union (28 countries)',
 'European Union (28 countries)',
 'European Union (28 countries)',
 'European Union (27 countries)',
 'European Union (27 countries)',
 'European Union (27 countries)',
 'European Union (27 countries)',
 'European Union (27 countries)',
 'European Union (27 countries)',
 'European Union (27 countries)',
 'European Union (27 countries)',
 'European Union (27 countries)',
 'European Union (27 countries)',
 'European Union (27 countries)',
 'European Union (27 countries)',
 'European Union (25 countries)',
 'European Union (25 countries)',
 'European Union (25 countries)',
 'European Union (25 countries)',
 'European Union (25 countries)',
 'European Uni

In [61]:
for item in collection.find({"GEO":{"$regex":reg}}):
     print (item['GEO'])

European Union (28 countries)
European Union (28 countries)
European Union (28 countries)
European Union (28 countries)
European Union (28 countries)
European Union (28 countries)
European Union (28 countries)
European Union (28 countries)
European Union (28 countries)
European Union (28 countries)
European Union (28 countries)
European Union (28 countries)
European Union (27 countries)
European Union (27 countries)
European Union (27 countries)
European Union (27 countries)
European Union (27 countries)
European Union (27 countries)
European Union (27 countries)
European Union (27 countries)
European Union (27 countries)
European Union (27 countries)
European Union (27 countries)
European Union (27 countries)
European Union (25 countries)
European Union (25 countries)
European Union (25 countries)
European Union (25 countries)
European Union (25 countries)
European Union (25 countries)
European Union (25 countries)
European Union (25 countries)
European Union (25 countries)
European U

### Update

In this section, several methods for updating and deleting documents are reveiwed:

+ Replace. This method finds the documents defined by query and **replaces** it by the new document. 

In [55]:
#Insert One new Document
import numpy as np
doc = {'Flag and Footnotes': np.nan,
  'GEO': 'Catalunya',
  'INDIC_ED': 'Total public expenditure on education as % of GDP, for all levels of education combined',
  'TIME': 2017,
  'Value': np.nan}
collection.insert_one(doc)

<pymongo.results.InsertOneResult at 0x11b6a5e00>

In [56]:
for doc in collection.find({'GEO':"Catalunya"}):
    print (doc)

{'_id': ObjectId('6154507e22ce1956b4fc36ad'), 'Flag and Footnotes': nan, 'GEO': 'Catalunya', 'INDIC_ED': 'Total public expenditure on education as % of GDP, for all levels of education combined', 'TIME': 2017, 'Value': nan}


In [57]:
#Replace first occurence
newdoc = {'Flag and Footnotes': np.nan,
  'GEO': 'Catalunya',
  'INDIC_ED': 'Total public expenditure on education as % of GDP, for all levels of education combined',
  'TIME': 2017,
  'Value': 15}
collection.replace_one({'GEO':"Catalunya"},newdoc)

for doc in collection.find({'GEO':"Catalunya"}):
    print (doc)

{'_id': ObjectId('6154507e22ce1956b4fc36ad'), 'Flag and Footnotes': nan, 'GEO': 'Catalunya', 'INDIC_ED': 'Total public expenditure on education as % of GDP, for all levels of education combined', 'TIME': 2017, 'Value': 15}


If we don't want to write again all the document and just put the field that we want to change then we have to use **update_one** and adding a sub-command.  Let us check some of them:

+ Sub-command **Set**:

This statement updates in the document in collection where field matches value1 by replacing the value of the field field1 with value2. This operator will add the specified field or fields if they do not exist in this document or replace the existing value of the specified field(s) if they already exist.

An upsert eliminates the need to perform a separate database call to check for the existence of a record before performing either an update or an insert operation. Typically update operations update existing documents, but in MongoDB, the update_one() operation can accept an upsert option as an argument. Upserts are a hybrid operation that use the query argument to determine the write operation:

If the query matches an existing document(s), the upsert performs an update.
If the query matches no document in the collection, the upsert inserts a single document.

In [59]:
#Update first occurrence
collection.update_one({'GEO':"Catalunya"},{"$set":{"Value":12}})

<pymongo.results.UpdateResult at 0x11b7f84c0>

In [60]:
for doc in collection.find({'GEO':"Catalunya"}):
    print (doc)

{'_id': ObjectId('6154507e22ce1956b4fc36ad'), 'Flag and Footnotes': nan, 'GEO': 'Catalunya', 'INDIC_ED': 'Total public expenditure on education as % of GDP, for all levels of education combined', 'TIME': 2017, 'Value': 12}


By default, if the filter doesn't return any document, nothing is inserted to database. If you want to insert it, then set the upsert flag to `True`

In [61]:
collection.update_one({'GEO':"Andorra"},{"$set":{"Value":12}},upsert = True)
for doc in collection.find({'GEO':"Andorra"}):
    print (doc)

{'_id': ObjectId('6154512dd87047bf2b704a8d'), 'GEO': 'Andorra', 'Value': 12}


+ Sub-commnad **Unset**:

The unset operator deletes a particular field. If documents match the initial query but do not have the field specified in the unset operation, there the statement has no effect on the document.

In [62]:
collection.update_one({'GEO':"Catalunya"},{"$unset":{"Flag and Footnotes":""}})

<pymongo.results.UpdateResult at 0x11b821800>

In [63]:
for doc in collection.find({'GEO':"Catalunya"}):
    print (doc)

{'_id': ObjectId('6154507e22ce1956b4fc36ad'), 'GEO': 'Catalunya', 'INDIC_ED': 'Total public expenditure on education as % of GDP, for all levels of education combined', 'TIME': 2017, 'Value': 12}


### Delete operations

We can remove elements by simply:

In [64]:
collection.delete_one({"GEO":"Andorra"})

<pymongo.results.DeleteResult at 0x11b821dc0>

In [65]:
for doc in collection.find({"GEO":"Andorra"}):
    print (doc)

And remove a collection by:

In [66]:
db.list_collection_names()

['edu']

In [67]:
db.drop_collection("edu")
db.list_collection_names()

[]

And remove a database by:

In [68]:
conn.list_database_names()

['admin', 'local']

In [69]:
conn.drop_database('ads')
conn.list_database_names()

['admin', 'local']

And finally close the connection with the database.

In [70]:
conn.close()