Ashwin Lakshman 001353233

# *HANDS-ON GUIDE TO MONGODB*
## *INFO6210* DATA MANAGEMENT AND DATABASE DESIGN FINAL PROJECT
## BY *ASHWIN LAKSHMAN* ( *001353233* )
### NORTHEASTERN UNIVERSITY SPRING 2020


#### Topics Covered:
##### Introduction to NoSQL Databases
##### Introduction to MongoDB
##### Installation Guide
##### Mongo Shell
##### CRUD Operations
##### Aggregate Pipeline
##### MapReduce
##### 10 Practice Exercises
##### 1 Practice Project

# *1. INTRODUCTION*
##  1.1. What is a NoSQL Database?
“NoSQL Database” is an umbrella-term for any non-relational database which stores data in a format other than relational tables. NoSQL databases can store relationship data—they just store it differently than relational databases do. In fact, when compared with SQL databases, many find modeling relationship data in NoSQL databases to be easier than in SQL databases, because related data doesn’t have to be split between tables. NoSQL data models allow related data to be nested within a single data structure.
## 1.2 Types of NoSQL Databases:
There are currently 4 different types of databases that fall under the NoSQL umbrella:
1.	**Document Databases:** Document databases store data in documents similar to JSON (JavaScript Object Notation) objects. Each document contains pairs of fields and values. The values can typically be a variety of types including things like strings, numbers, boolean, arrays, or objects, and their structures typically align with objects developers are working with in code. Because of their variety of field value types and powerful query languages, document databases are great for a wide variety of use cases and can be used as a general purpose database. Examples: MarkLogic, InterSystems Caché, MongoDB, OrientDB

2.	**Key-value databases:** Key-value databases are a simpler type of database where each item contains keys and values. A value can only be retrieved by referencing its value. Key-value databases are great for use cases where you need to store large amounts of data but you don’t need to perform complex queries to retrieve it. Examples: Redis, Riak, and Oracle NoSQL database


3.	**Column Datastores:** Column Datastores store data in tables, rows, and dynamic columns. These databases are great for when you need to store large amounts of data and you can predict what your query patterns will be. Examples: Cassandra, Bigtable, Cassandra, HBase and Vertica.
4.	**Graph Databases:** Graph databases store data in nodes and edges. Nodes store information about people, places, and things while edges store information about the relationships between the nodes. Graph databases work best in databases where there is a need to traverse relationships to look for patterns. There are a wide variety of applications for these databases  such as social networks, fraud detection and recommendation engines. Examples: JanusGraph, MarkLogic, Neo4j

## 1.3 MongoDB
MongoDB is an open source document-oriented database system developed and supported by 10gen. It is part of the NoSQL family of database systems. Instead of storing data in tables as is done in a relational database, MongoDB stores structured data as Binary JSON documents with dynamic schemas (called BSON), making the integration of data in certain types of applications easier and faster.
## 1.4 Installation
Download & Installation Documentation Link: https://docs.mongodb.com/manual/installation/
MongoDB shell runs on the Linux kernel and hence, it cannot be directly installed on a windows system. However, there are multiple workarounds:
1.	Running Linux on a virtual machine
2.	Running the Linux Terminal App from the Windows store
3.	Making use of a Python Distribution such as PyMongo library
For users running Mac OS, the installation is straightforward since the Mac OS is built on the Linux kernel.
For this guide, we will be using PyMongo as the chosen method of implementation. 

**PyMongo Installation:** Enter this into the terminal: *$ python -m pip install pymongo* 

**Install Jupyter:** If not yet installed, install jupyter using the following Terminal command: $ pip install jupyter

To start MongoDB server, open the terminal, navigate to the directory where MongoDB is installed and run *mongo* as  follows: ashwin@ubuntu:~/projects/mongodb$./bin/mongo

MongoDB is now running on your system. To exit and go to the previous terminal window, type *quit*. To terminate the server press Ctrl+C.

## 1.5 Pymongo
PyMongo is a Python distribution containing tools for working with MongoDB, and is the recommended way to work with MongoDB from Python. 
Once Pymongo has been installed, we can write a test application to return information about the Mongo server. Using a Python IDE(recommended) such as PyCharm or a text editor, type the following code:

In [1]:
import pymongo
from pymongo import MongoClient
# pprint library is used to make the output easier to read(pretty)
from pprint import pprint
# connect to MongoDB
try:
 conn = MongoClient()
 db=conn.admin
 # Issue the serverStatus command and print the results
 serverStatusResult=db.command("serverStatus")
 print("connected successfully!")
 pprint(serverStatusResult)
except :
   print("Could not connect to MongoDB"  )
conn


connected successfully!
 'connections': {'active': 1,
                 'available': 999998,
                 'current': 2,
                 'totalCreated': 47},
 'extra_info': {'availPageFileMB': 5107,
                'note': 'fields vary by platform',
                'page_faults': 479986,
                'ramMB': 16226,
                'totalPageFileMB': 23394,
                'usagePageFileMB': 190},
 'flowControl': {'enabled': True,
                 'isLagged': False,
                 'isLaggedCount': 0,
                 'isLaggedTimeMicros': 0,
                 'locksPerOp': 0.0,
                 'sustainerRate': 0,
                 'targetRateLimit': 1000000000,
                 'timeAcquiringMicros': 0},
 'freeMonitoring': {'state': 'undecided'},
 'globalLock': {'activeClients': {'readers': 0, 'total': 0, 'writers': 0},
                'currentQueue': {'readers': 0, 'total': 0, 'writers': 0},
                'totalTime': 764239077000},
 'host': 'LAPTOP-J0E3R9P8',
 'localTime': d

MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True)

# 2. The Database, Collections and Documents:
MongoDB stores JSON data in Binary JSON documents called BSON and their representation varies by programming language.
An example of a JSON document is as follows:



From the above example, it is evident that documents are not just key/value pairs but can include arrays and subdocuments.
 It supports various data types like geospatial, decimal, and ISODate.
 Internally MongoDB stores a binary representation of JSON known as BSON.
  This allows MongoDB to provide data types like decimal that are not defined in the JSON specification.
   
The SQL/RDBMS equivalent of a MongoDB Document is a row/record.

The equivalent of an SQL table is the **MongoDB Collection**
A collection in MongoDB is a container for documents. A database is the container for collections.
Compared to relational databases, collections are similar to tables, and documents similar to records, but with one big difference: every record in a table has the same fields in the same order, whereas each document in a collection can have completely different fields from the other documents.

There are many advantages to storing data in documents such as dynamic, flexible schema, and the ability to store arrays from simple Python scripts.


** Creating a MongoDB database**: Mongodb creates databases and collections automatically if they don't already exist.
 A single instance of MongoDB can support multiple independent databases. When working with PyMongo you access databases using attribute style access:

In [2]:
db = conn.mydb
db




Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'mydb')

To view the names of the available databases, type:

In [3]:
conn.database_names()



  conn.database_names()


['admin',
 'assignment1db',
 'assignment2db',
 'assignment3',
 'config',
 'gamedb',
 'local',
 'moviedb',
 'mydb',
 'nyse',
 'testdb',
 'twitter_search']

However, this will  not display the databases with none or empty collections.

**Creating a MongoDB Collection:** A collection is a group of documents stored in MongoDB, and is basically the equivalent of a table in a relational database. 

Creating a collection in PyMongo works the same as creating a database:



In [4]:
collection = db.my_collection
collection




Collection(Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'mydb'), 'my_collection')

**Creating a MongoDB Document:**
MongoDB stores structured data as JSON with  dynamic schemas (called BSON), instead of predefined schemas. An element of data is called a document, and documents are stored in collections. One collection may have any number of documents.
When using Pymongo, documents are Python dictionaries that can have strings as keys and can contain various primitive types (int, float,unicode, datetime) as well as other documents (Python dictionaries) and arrays (Python lists).
To insert data into MongoDB, we have to create a dictionary and call .insert() on the collection object:


In [5]:
doc = {"fname":"Ashwin","lname":"Lakshman","insta":"@ashwinlakshman"}

Next, the document needs to be inserted into the collection:



In [6]:
collection.insert(doc)




  collection.insert(doc)


ObjectId('5ea3ac4f0fee33e9a1859bd1')

We have now successfully created a document inside a MongoDB collection!


# 3. CRUD Operations

Before we begin with the basic CRUD operations, let’s make use of real world data to make this tutorial more understandable. We will make use of the Tweepy API to collect data from Twitter. Now let’s pull the first five pages delated to Big Data.

In [7]:
import tweepy
from tweepy import API
lookup ='BigData'
auth = tweepy.OAuthHandler('hzTTj9HU4Ru77Z7eKGO7IRXJw', 'WAKstXWAnWVvE2DYZciThLm4y9iuPVdoig4nFUMeQkLi5pU9rI')
auth.set_access_token('1231001224624738304-vbZOTSOt1a3ZhGFCsdmwkNtCMq8J78', 'TJ33XPyZNbVAaw7TqpNZFSxHUf6VNqvIBIcsbngKrXvrC')

api = API(auth)

search = []

tweets = api.search(lookup,count=10)
for tweet in tweets:
        search.append(tweet)
search


[Status(_api=<tweepy.api.API object at 0x000002DF37904F70>, _json={'created_at': 'Sat Apr 25 03:19:32 +0000 2020', 'id': 1253886332499345420, 'id_str': '1253886332499345420', 'text': 'RT @IAM__Network: Early digital transformation efforts pay dividends as business continuity plans hold strong – Which-50 \n\nREAD MORE: https…', 'truncated': False, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'IAM__Network', 'name': 'IAM Platform', 'id': 226310002, 'id_str': '226310002', 'indices': [3, 16]}], 'urls': []}, 'metadata': {'iso_language_code': 'en', 'result_type': 'recent'}, 'source': '<a href="https://google.com" rel="nofollow">digital transformation</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 1064218391920214016, 'id_str': '1064218391920214016', 'name': 'digital transformation', 'screen_name': 'digitaltransf11', 'location'

Now that we have the data in the form of a Python dictionary, we can check its length:


In [8]:
len(search)

10


To view the contents of each dictionary object:


In [9]:
dir(search[0])


['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_api',
 '_json',
 'author',
 'contributors',
 'coordinates',
 'created_at',
 'destroy',
 'entities',
 'favorite',
 'favorite_count',
 'favorited',
 'geo',
 'id',
 'id_str',
 'in_reply_to_screen_name',
 'in_reply_to_status_id',
 'in_reply_to_status_id_str',
 'in_reply_to_user_id',
 'in_reply_to_user_id_str',
 'is_quote_status',
 'lang',
 'metadata',
 'parse',
 'parse_list',
 'place',
 'retweet',
 'retweet_count',
 'retweeted',
 'retweeted_status',
 'retweets',
 'source',
 'source_url',
 'text',
 'truncated',
 'user']

In order to simplify the data we have scraped, we remove the fields which are not necessary for this exercise.

But first, let's create a new DB and Collection again..

**EXERCISE 1:** Create a database "twitter_search", using MongoClient instance "conn". And then create a Collection "posts"

In [10]:
# Enter code here and Run
# SAMPLE CODE ( TO BE REMOVED)
db = conn.twitter_search
posts = db.posts

Next, we loop through the each object in the "search" dictionary and insert into MongoDB


In [11]:
for tweet in search:
    # Empty dictionary for storing tweet related data
    data ={}
    data['created_at'] = tweet.created_at
    data['geo'] = tweet.geo
    data['id'] = tweet.id
    data['retweet_count'] = tweet.retweet_count
    data['source'] = tweet.source
    data['text'] = tweet.text
    data['in_reply_to_screen_name'] = tweet.in_reply_to_screen_name
    
    # Inserting Document into MongoDB Collection
    posts.insert_one(data)
data


{'created_at': datetime.datetime(2020, 4, 25, 3, 17, 6),
 'geo': None,
 'id': 1253885720005148673,
 'retweet_count': 47,
 'source': 'Twitter for iPhone',
 'text': 'RT @Fisher85M: What’s your #Security Maturity level? [Infographic]\n\n#CyberSecurity #infosec #education @Fisher85M #BigData #Analytics #Mach…',
 'in_reply_to_screen_name': None,
 '_id': ObjectId('5ea3ac500fee33e9a1859bdb')}

Congratulations! The document has been inserted into the MongoDB collection!

**EXERCISE 2**
Insert a manually entered and unrelated document into the same collection:
Name = "Ashwin"
Age = "7"
Gender "Male"
With the address as a sub-document:
address: street="pennsylvania avenue"
         number = "1600"
         city = "Washington DC"
         

In [12]:
# Enter code here
# SAMPLE CODE TO BE REMOVED
Name = "Ashwin"
Age = "7"
Gender ="Male"
street="pennsylvania avenue"
number = "1600"
city = "Washington DC"
data = {  'name' : Name ,                                    # String 
          'age' : Age,                                       # Integer
          'gender' : Gender,                                 # String 
          'address': {
              'street' : street,                             # String ( special character with escape \ )
              'number' : number,                             # Integer
              'city' : city                                 # String 
              }
       }

insert_result = posts.insert_one( data)

Although this document is completely unrelated to the twitter document, MongoDB still allows them to be inserted into the database.



In [13]:
#To confirm that the insert is successul, use the following command:

insert_result.acknowledged    



True

In [14]:
# To view the id of the current document:

insert_result.inserted_id


ObjectId('5ea3ac500fee33e9a1859bdc')

**The _id field:**

 MongoDB documents are stored in a collection require a unique _id field that acts as a primary key. Because ObjectIds are small, unique, and fast to generate, MongoDB uses ObjectIds as the default value for the _id field if the _id field is not specified.
  i.e, the mongoDB adds the _id field and generates a unique ObjectId to assign as its value



**INSERT_MANY:**
 You can also insert many documents at a time into the Collection using the following command:
 collection.insert_many([{document1 contents},{document2 contents}])


**EXERCISE 3:** Similarly to the previous exercise, insert multiple documents at the same time into a collection:

In [15]:
# ENTER CODE HERE
# SAMPLE CODE TO BE REMOVED
posts.insert_many([{'name' : 'Steve' ,                                    # String 
          'age' : 25,                                       # Integer
          'gender' : 'Male',                                 # String 
          'address': {
              'street' : 'Huntington',                             # String 
              'number' : 23,                             # Integer
              'city' : 'Boston'                                 # String 
              }},{'name' : 'Tom' ,                                    # String 
          'age' : 33,                                       # Integer
          'gender' : 'Male',                                 # String 
          'address': {
              'street' : 'Mass ave',                             # String 
              'number' : 44,                             # Integer
              'city' : 'Boston'                                 # String 
              }}])

<pymongo.results.InsertManyResult at 0x2df389eaa00>

## Reading Documents in a Collection
To retrieve data from a collection, we make use of the find() and find_one() functions.

The find_one() method is used to return a single document from a (or None if there are no matches). 
It is useful when there is only one matching document, or are only interested in the first match:




In [16]:
posts.find_one()



{'_id': ObjectId('5ea325962a0f212921292cc9'),
 'created_at': datetime.datetime(2020, 4, 24, 17, 43, 38),
 'geo': None,
 'id': 1253741401810010119,
 'retweet_count': 41,
 'source': 'Twitter Web App',
 'text': 'RT @SpirosMargaris: How #AI can tackle the #climate emergency \n\nif developed responsibly\n\nhttps://t.co/glveQ74s9k #fintech @ConversationUK…',
 'in_reply_to_screen_name': None}

To get more than a single document as the result of a query we use the find() method. find() returns a Cursor instance, which allows iteration over all matching documents.



In [17]:
posts.find()



<pymongo.cursor.Cursor at 0x2df389cd430>

For example, we can iterate over the first 5 documents  in the posts collection:

In [18]:
for d in posts.find()[:5]:
    print(d)


{'_id': ObjectId('5ea325962a0f212921292cc9'), 'created_at': datetime.datetime(2020, 4, 24, 17, 43, 38), 'geo': None, 'id': 1253741401810010119, 'retweet_count': 41, 'source': 'Twitter Web App', 'text': 'RT @SpirosMargaris: How #AI can tackle the #climate emergency \n\nif developed responsibly\n\nhttps://t.co/glveQ74s9k #fintech @ConversationUK…', 'in_reply_to_screen_name': None}
{'_id': ObjectId('5ea325962a0f212921292cca'), 'created_at': datetime.datetime(2020, 4, 24, 17, 43, 24), 'geo': None, 'id': 1253741345828745216, 'retweet_count': 8, 'source': 'Twitter for Android', 'text': 'RT @ipfconline1: 10 Groups of #MachineLearning Algorithms \n\nhttps://t.co/dQsSp6ozVa @thomas_glare v/ @datamadesimple\n#AI #DeepLearning\nCc @…', 'in_reply_to_screen_name': None}
{'_id': ObjectId('5ea325962a0f212921292ccb'), 'created_at': datetime.datetime(2020, 4, 24, 17, 43, 17), 'geo': None, 'id': 1253741316011327488, 'retweet_count': 9, 'source': 'Twitter for Android', 'text': 'RT @DeepLearn007: @mercer 

This can also be displayed as a Python List:

In [19]:
list(posts.find())[:5]

[{'_id': ObjectId('5ea325962a0f212921292cc9'),
  'created_at': datetime.datetime(2020, 4, 24, 17, 43, 38),
  'geo': None,
  'id': 1253741401810010119,
  'retweet_count': 41,
  'source': 'Twitter Web App',
  'text': 'RT @SpirosMargaris: How #AI can tackle the #climate emergency \n\nif developed responsibly\n\nhttps://t.co/glveQ74s9k #fintech @ConversationUK…',
  'in_reply_to_screen_name': None},
 {'_id': ObjectId('5ea325962a0f212921292cca'),
  'created_at': datetime.datetime(2020, 4, 24, 17, 43, 24),
  'geo': None,
  'id': 1253741345828745216,
  'retweet_count': 8,
  'source': 'Twitter for Android',
  'text': 'RT @ipfconline1: 10 Groups of #MachineLearning Algorithms \n\nhttps://t.co/dQsSp6ozVa @thomas_glare v/ @datamadesimple\n#AI #DeepLearning\nCc @…',
  'in_reply_to_screen_name': None},
 {'_id': ObjectId('5ea325962a0f212921292ccb'),
  'created_at': datetime.datetime(2020, 4, 24, 17, 43, 17),
  'geo': None,
  'id': 1253741316011327488,
  'retweet_count': 9,
  'source': 'Twitter for An

**Count Documents:**
To **count** the number of documents in a collection, we can use count_documents() or estimated_document_count() or  count() functions.
**NOTE** : the count() function has been deprecated in a recent update, hence it is recommended to use count_documents()

In [20]:
posts.count()


#posts.count_documents()

  posts.count()


213

## MongoDB Queries:
MongoDB queries are written in a JSON format, similar to the document.
To create a query, you must specify a dictionary with the required attributes.


In [21]:
posts.find({"source": "Twitter for Android"}).count()


  posts.find({"source": "Twitter for Android"}).count()


39

The search queries can also make use of special query operators. These operators include gt, gte, lt, lte, ne, nin, regex, exists, not, or and more. For example:


In [22]:
from datetime import datetime
date1 = datetime.strptime("20/04/10 15:15", "%d/%m/%y %H:%M") 
cursor = posts.find({'created_at':{"$gt":date1}})
cursor.next()


{'_id': ObjectId('5ea325962a0f212921292cc9'),
 'created_at': datetime.datetime(2020, 4, 24, 17, 43, 38),
 'geo': None,
 'id': 1253741401810010119,
 'retweet_count': 41,
 'source': 'Twitter Web App',
 'text': 'RT @SpirosMargaris: How #AI can tackle the #climate emergency \n\nif developed responsibly\n\nhttps://t.co/glveQ74s9k #fintech @ConversationUK…',
 'in_reply_to_screen_name': None}

Appending the count() function at the end of a find() function allows it to be counted instead:


In [23]:
posts.find({'created_at':{"$gt":date1}}).count()




  posts.find({'created_at':{"$gt":date1}}).count()


177

**EXERCISE 4**
 Find the number of posts which was created on or before "20/04/23 12:15"

In [24]:
#ENTER CODE HERE
#SAMPLE ANSWER TO BE REMOVED:
date2 = datetime.strptime("20/04/23 12:15", "%d/%m/%y %H:%M") 
posts.find({'created_at':{"$lte":date2}}).count()



  posts.find({'created_at':{"$lte":date2}}).count()


177

**EXERCISE 5** Find the number of posts which was created between "20/04/20 12:15" and "20/04/24 09:15"


In [25]:
#ENTER CODE HERE
#SAMPLE ANSWER TO BE REMOVED:
date3 = datetime.strptime("20/04/20 12:15", "%d/%m/%y %H:%M") 
date4 = datetime.strptime("20/04/24 09:15", "%d/%m/%y %H:%M") 
posts.find({'created_at':{"$gte": date3, "$lt": date4}}).count()




  posts.find({'created_at':{"$gte": date3, "$lt": date4}}).count()


177

**EXERCISE 6**
 Find all posts except the ones with "source: Twitter from Android" and "source: Twitter Web App"

In [26]:
#ENTER CODE HERE
#SAMPLE ANSWER TO BE REMOVED:
posts.find({"source":{"$nin":["Twitter for Android","Twitter Web App"]}}).count()


  posts.find({"source":{"$nin":["Twitter for Android","Twitter Web App"]}}).count()


125

## Sorting:
MongoDB is capable of sorting a query on the server-side in a more efficient way than on the client-side.

For example, let's get the most recent post by sorting the query in descending order and looking at the first:


In [27]:
lastpost = posts.find().sort([("created_at", pymongo.DESCENDING)]).limit(1)

for t in lastpost: print (t)

{'_id': ObjectId('5ea3ac500fee33e9a1859bd2'), 'created_at': datetime.datetime(2020, 4, 25, 3, 19, 32), 'geo': None, 'id': 1253886332499345420, 'retweet_count': 2, 'source': 'digital transformation', 'text': 'RT @IAM__Network: Early digital transformation efforts pay dividends as business continuity plans hold strong – Which-50 \n\nREAD MORE: https…', 'in_reply_to_screen_name': None}


## Updating a MongoDB Document
We can update or modify a document in a MongoDB collection in several ways.

First, lets look at the document we created earlier:

In [28]:
posts.find_one({"name":"Ashwin"})



{'_id': ObjectId('5ea3590f59505cd04417c8ab'),
 'name': 'Ashwin',
 'age': '7',
 'gender': 'Male',
 'address': {'street': 'pennsylvania avenue',
  'number': '1600',
  'city': 'Washington DC'}}

We can update the document as follows:


In [29]:
posts.update({"name":"Ashwin"},{"text":"HELLO! This is my first update!!"})

# To view the updated document, lets find it again:
posts.find_one({"name":"Ashwin"})


  posts.update({"name":"Ashwin"},{"text":"HELLO! This is my first update!!"})


{'_id': ObjectId('5ea35a1b59505cd04417c8ae'),
 'name': 'Ashwin',
 'age': '7',
 'gender': 'Male',
 'address': {'street': 'pennsylvania avenue',
  'number': '1600',
  'city': 'Washington DC'}}

### Set : Update Operator
The set operator can be used to update a specific document which matches based on the first argument.
And then adds or replaces existing values in the document based on the second set of arguments.


In [30]:
posts.update({"name":"Ashwin"},{"$set":{"retweets":50,"source":"Twitter from Android"}})
posts.find_one({"name":"Ashwin"})



  posts.update({"name":"Ashwin"},{"$set":{"retweets":50,"source":"Twitter from Android"}})


{'_id': ObjectId('5ea35a1b59505cd04417c8ae'),
 'name': 'Ashwin',
 'age': '7',
 'gender': 'Male',
 'address': {'street': 'pennsylvania avenue',
  'number': '1600',
  'city': 'Washington DC'},
 'retweets': 50,
 'source': 'Twitter from Android'}

### inc : Update Operator
The inc operator can be used to increment the value of a certain field. If the field does not exist in the document, it will be updated to add that field with the increment value.

In [31]:

posts.update({"name":"Ashwin"},{"$inc":{"retweets":100}})
posts.find_one({"name":"Ashwin"})

  posts.update({"name":"Ashwin"},{"$inc":{"retweets":100}})


{'_id': ObjectId('5ea35a1b59505cd04417c8ae'),
 'name': 'Ashwin',
 'age': '7',
 'gender': 'Male',
 'address': {'street': 'pennsylvania avenue',
  'number': '1600',
  'city': 'Washington DC'},
 'retweets': 150,
 'source': 'Twitter from Android'}

### rename : update operator   
The rename operator can be used to change the name of a field in a document.

In [32]:
posts.update({"name":"Ashwin"},{"$rename":{"name":"first_name"}})
posts.find_one({"first_name":"Ashwin"})


  posts.update({"name":"Ashwin"},{"$rename":{"name":"first_name"}})


{'_id': ObjectId('5ea35a1b59505cd04417c8ae'),
 'age': '7',
 'gender': 'Male',
 'address': {'street': 'pennsylvania avenue',
  'number': '1600',
  'city': 'Washington DC'},
 'retweets': 150,
 'source': 'Twitter from Android',
 'first_name': 'Ashwin'}

In [33]:
posts.find_one({"first_name":"Ashwin"})



{'_id': ObjectId('5ea35a1b59505cd04417c8ae'),
 'age': '7',
 'gender': 'Male',
 'address': {'street': 'pennsylvania avenue',
  'number': '1600',
  'city': 'Washington DC'},
 'retweets': 150,
 'source': 'Twitter from Android',
 'first_name': 'Ashwin'}

### push : update operator
The push operator appends a value to the document.
However, if the the field is not present, the push operation will create a new field by the specified name.
Note: Push operations only work with array values.
Example:
posts.update({"first_name":"Ashwin"},{"$push":{"score":{"nb":random.randint(-5, 5),"svm":random.randint(-5, 5)}}})
posts.find_one({"first_name":"Ashwin"})

### pop : update operator
The pop operator is used to remove the outer vallue from the array. Pop operation can be used on either the start of the array (By passing -1) or the end of an array ( By passing 1 ).


posts.update({"first_name":"Ashwin"},{"$pop":{"score":1}})

posts.find_one({"first_name":"Ashwin"})


### pull : update operator
The pull operator is used to remove all instances of a specific value from an array; this operator is especially useful when having to work with redundant values.



In [34]:
# First, we will add a sub-document / array
posts.update({"first_name":"Ashwin"},{"$set":{"likes":["nature","music","DMDD","video games"]}})
posts.find_one({"first_name":"Ashwin"})

  posts.update({"first_name":"Ashwin"},{"$set":{"likes":["nature","music","DMDD","video games"]}})


{'_id': ObjectId('5ea35a1b59505cd04417c8ae'),
 'age': '7',
 'gender': 'Male',
 'address': {'street': 'pennsylvania avenue',
  'number': '1600',
  'city': 'Washington DC'},
 'retweets': 150,
 'source': 'Twitter from Android',
 'first_name': 'Ashwin',
 'likes': ['nature', 'music', 'DMDD', 'video games']}

In [35]:
# Now, we pull DMDD from the array
posts.update({"first_name":"Ashwin"},{"$pull":{"likes":"DMDD"}})
posts.find_one({"first_name":"Ashwin"}) 


  posts.update({"first_name":"Ashwin"},{"$pull":{"likes":"DMDD"}})


{'_id': ObjectId('5ea35a1b59505cd04417c8ae'),
 'age': '7',
 'gender': 'Male',
 'address': {'street': 'pennsylvania avenue',
  'number': '1600',
  'city': 'Washington DC'},
 'retweets': 150,
 'source': 'Twitter from Android',
 'first_name': 'Ashwin',
 'likes': ['nature', 'music', 'video games']}

In [36]:
# the update_many() method can be used when there are multiple  documents to be updated at the same time.
posts.update_many( {"first_name":"Ashwin"}, {'$set' : {"first_name":"Not Ashwin"}} )
list (posts.find( ))

[{'_id': ObjectId('5ea325962a0f212921292cc9'),
  'created_at': datetime.datetime(2020, 4, 24, 17, 43, 38),
  'geo': None,
  'id': 1253741401810010119,
  'retweet_count': 41,
  'source': 'Twitter Web App',
  'text': 'RT @SpirosMargaris: How #AI can tackle the #climate emergency \n\nif developed responsibly\n\nhttps://t.co/glveQ74s9k #fintech @ConversationUK…',
  'in_reply_to_screen_name': None},
 {'_id': ObjectId('5ea325962a0f212921292cca'),
  'created_at': datetime.datetime(2020, 4, 24, 17, 43, 24),
  'geo': None,
  'id': 1253741345828745216,
  'retweet_count': 8,
  'source': 'Twitter for Android',
  'text': 'RT @ipfconline1: 10 Groups of #MachineLearning Algorithms \n\nhttps://t.co/dQsSp6ozVa @thomas_glare v/ @datamadesimple\n#AI #DeepLearning\nCc @…',
  'in_reply_to_screen_name': None},
 {'_id': ObjectId('5ea325962a0f212921292ccb'),
  'created_at': datetime.datetime(2020, 4, 24, 17, 43, 17),
  'geo': None,
  'id': 1253741316011327488,
  'retweet_count': 9,
  'source': 'Twitter for An

## Deleting a database, collection or Document:
###Database:
Deleting a database is relatively simple process and it can be done using the **drop_database()** command.

In [37]:
# first, lets look at the available databases:
conn.database_names()


  conn.database_names()


['admin',
 'assignment1db',
 'assignment2db',
 'assignment3',
 'config',
 'gamedb',
 'local',
 'moviedb',
 'mydb',
 'nyse',
 'testdb',
 'twitter_search']

In [38]:
# here, we can delete the "test1" database:
conn.drop_database("test1")
conn.database_names()

  conn.database_names()


['admin',
 'assignment1db',
 'assignment2db',
 'assignment3',
 'config',
 'gamedb',
 'local',
 'moviedb',
 'mydb',
 'nyse',
 'testdb',
 'twitter_search']

### Collection
Deleting a collection is similar to deleting a database and it can be done using the drop_collections() method:

In [39]:
# First we create a database, a collection and a document inside it:
conn.testdb.testcollection.insert({"Message":"Dropping a collection"})
conn.testdb.collection_names()


  conn.testdb.testcollection.insert({"Message":"Dropping a collection"})
  conn.testdb.collection_names()


['mycol', 'testcollection']

In [40]:
# Now, we drop the collection:

conn.testdb.drop_collection("testcollection")
conn.testdb.collection_names()


  conn.testdb.collection_names()


['mycol']

### Remove a Document :
Unlike a database and collection, we delete a document using the remove() method. With the attributes of the method being tags to identify the document.
Example:

In [41]:
posts.remove({"first_name":"Ashwin"})
posts.find_one({"first_name":"Ashwin"}) 

  posts.remove({"first_name":"Ashwin"})


Similarly, we can remove multiple documents using the remove_many() method.
Example:

In [42]:
delete = posts.delete_many({"Age": 28})    # deletes as many documents as the filter
list (posts.find( ))


[{'_id': ObjectId('5ea325962a0f212921292cc9'),
  'created_at': datetime.datetime(2020, 4, 24, 17, 43, 38),
  'geo': None,
  'id': 1253741401810010119,
  'retweet_count': 41,
  'source': 'Twitter Web App',
  'text': 'RT @SpirosMargaris: How #AI can tackle the #climate emergency \n\nif developed responsibly\n\nhttps://t.co/glveQ74s9k #fintech @ConversationUK…',
  'in_reply_to_screen_name': None},
 {'_id': ObjectId('5ea325962a0f212921292cca'),
  'created_at': datetime.datetime(2020, 4, 24, 17, 43, 24),
  'geo': None,
  'id': 1253741345828745216,
  'retweet_count': 8,
  'source': 'Twitter for Android',
  'text': 'RT @ipfconline1: 10 Groups of #MachineLearning Algorithms \n\nhttps://t.co/dQsSp6ozVa @thomas_glare v/ @datamadesimple\n#AI #DeepLearning\nCc @…',
  'in_reply_to_screen_name': None},
 {'_id': ObjectId('5ea325962a0f212921292ccb'),
  'created_at': datetime.datetime(2020, 4, 24, 17, 43, 17),
  'geo': None,
  'id': 1253741316011327488,
  'retweet_count': 9,
  'source': 'Twitter for An

## PRACTICE PROJECT
Use a social media API such as Twitter, reddit or instagram to pull data from user posts/comments/uploads on topics related to the coronavirus.
Filter the obtained data to produce only required fields and insert it into MongoDB.
Perform analyses on the data using different queries to find useful statistics such as:
 1. cities/districts with most infected, 
 2. find the dates where these topics were trending
 3. modify the data to include further details.

In [43]:
# begin practice here


### FURTHER EXERCISES
#### EXERCISE 7

Using what you have leanred, create a new database to store a collection about superheroes.
Add 5 superhero documents with attributes such as Name, Country, Powers etc
Now, Use what you've learned to Write 4 interesting queries that return JSON objects.

In [44]:
# Enter your code here


#### EXERCISE 8
Use what you have learned to delete only the superheroes who have names that start with a letter equal to or higher in the alphabet than "P"


In [45]:
# Enter your code here




# MongoDB Aggregation

MongoDB Aggregation is used to process data records,  group values from multiple documents together, and can perform a variety of operations on the grouped data to return a single result. 

There are 3 methods by which we can perform **Aggregation:**
#### 1. Using Aggregation Pipeline
#### 2. Using MapReduce Functions
#### 3. Other single purpose aggregation methods.

### Aggregation Pipeline:
MongoDB has modeled its aggregation framework using the concept of data processing pipelines.
In this method, the documents enter a multistage pipeline where they are transformed into an aggregated result.

This process is done in two stages:
1. **$match Stage:** Here, the documents are filtered based on a specific criteria,attributes or specific values in fields. 
 
 The documents which are passed to the next stage all pass the specific defined criteria.
2. **$group Stage:** Here, the documents are grouped by specific criteria and any calculations can be performed.

There are other pipeline operations which can provide tools for grouping and sorting documents by specific fields, and tools for aggregating the contents of arrays, including arrays of documents.
Pipelines can also make use of operators for tasks such as concatenation or calculating the mean.
For example,

In [46]:
# Creating an aggregate function to calculate the total number of retweets of our twitter posts.

db = conn.twitter_search
agr = [ {'$group': {'_id': 1, 'all': { '$sum':'$retweet_count' } } } ]
val = list(db.posts.aggregate(agr))
print('The total number of retweets is {}'.format(val[0]['all']))


The total number of retweets is 2810


The above example calculates the sum of all the retweets of the posts in our MongoDB collection.
Here, the *$sum* operator calculates and returns the sum of numeric values. 
And the *$group* operator groups input documents by a specific expression and applies the accumulator expressions, to each group.


In [47]:
list(db.posts.find())


[{'_id': ObjectId('5ea325962a0f212921292cc9'),
  'created_at': datetime.datetime(2020, 4, 24, 17, 43, 38),
  'geo': None,
  'id': 1253741401810010119,
  'retweet_count': 41,
  'source': 'Twitter Web App',
  'text': 'RT @SpirosMargaris: How #AI can tackle the #climate emergency \n\nif developed responsibly\n\nhttps://t.co/glveQ74s9k #fintech @ConversationUK…',
  'in_reply_to_screen_name': None},
 {'_id': ObjectId('5ea325962a0f212921292cca'),
  'created_at': datetime.datetime(2020, 4, 24, 17, 43, 24),
  'geo': None,
  'id': 1253741345828745216,
  'retweet_count': 8,
  'source': 'Twitter for Android',
  'text': 'RT @ipfconline1: 10 Groups of #MachineLearning Algorithms \n\nhttps://t.co/dQsSp6ozVa @thomas_glare v/ @datamadesimple\n#AI #DeepLearning\nCc @…',
  'in_reply_to_screen_name': None},
 {'_id': ObjectId('5ea325962a0f212921292ccb'),
  'created_at': datetime.datetime(2020, 4, 24, 17, 43, 17),
  'geo': None,
  'id': 1253741316011327488,
  'retweet_count': 9,
  'source': 'Twitter for An

### EXERCISE 9
Create an aggregate pipeline function to find the number of posts in each category of "source"

In [48]:
# Enter your code here


![title](MR.jpg "Title")

### MapReduce:
MongoDB also allows aggregation to be performed through MapReduce.
MapReduce consists of two phases:
1. **Map:** This stage processes each document and emits one or more objects for each input document
2. **Reduce:** This stage combines the output of the map operation.

There can also be an optional **finalize** function to perform further reductions on the reduced data.
MapReduce can be used to specify a query condition to select the input documents as well as sort and limit the results.
Furthermore, MapReduce can operate on and output to a sharded collection.
For Example:

In [49]:
# First, lets insert some sample data to work with.
from pymongo import Connection
db = Connection().mapreduce_example
db.animals.insert({"x": 1, "tags": ["dog", "cat"]})

db.animals.insert({"x": 2, "tags": ["cat"]})

db.animals.insert({"x": 3, "tags": ["rabbit", "cat", "dog"]})

db.animals.insert({"x": 4, "tags": []})


ImportError: cannot import name 'Connection' from 'pymongo' (c:\python38\lib\site-packages\pymongo\__init__.py)

Next, we can define our Mapper and Reducer functions.

In [50]:
# Goal: to count the number of occurrences for each tag in the tags array, across the entire collection

# map function just emits a single (key, 1) pair for each tag in the array
# he reduce function sums over all of the emitted values for a given key
#db.animals.mapReduce(function(){emit(this.tags,1); },function(key, values) {return Array.sum(values)}, { query:{tags:"cat"},  out:"total"  })

### EXERCISE 10
Create a new collection "Cars"
Insert several documents using insert_many into the collection with attributes such as Name, Brand, Manufacture_year, Country, Horesepower, Type, Mileage
Create a Mapper function to emit each car that is manufactured in the US.
Write a Reducer function to count the number of cars for each Brand of car.


In [None]:
# Enter Code here



### License:
The code in this tutorial by Ashwin Lakshman is licensed under The Creative Commons Attribution 3.0 License
https://creativecommons.org/licenses/by/3.0/us/ 

Copyright <2019> <'ASHWIN LAKSHMAN'>

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
