<h1 style='text-aling:center;color:Navy'>  Big Data Systems - Laboratory 2  </h1>

# <span style="color:#3665af">MongoDB </span><span style="font-size:15px">(Estimated time: 20 minutes) </span>

<hr>
In this section, we will practice how to use MongoDB. 
## Pre-reqs:
You need to have your environment set up on Google Cloud working. Please refer to the Google Cloud instructions for this setup. 

## Uploading your files to the cloud
We need to upload the file cities.txt to the bucket. In the following procedure we will refer to the bucket name as **bigdatasystem\_1234\_bucket** so you need to replace that with your bucket name.

## Creating a Bucket and uploading it. 
1. Go to the cloud console.
2. Using the menu select Storage and then select your bucket. If you don't have one, you need to create one.
3. Drag your file to the bucket. 


![console](img/Storage.PNG)


## Configuring a MongoDB Instance.

Google cloud provides two ways to deploy a MongoDB instance. The first one is by creating a MongoDB Cluster. For this lab that option is too costly. The second one is by deploying a container. We will implement this latter approach. 

To do that, first open the Google Cloud console. Go to the web console and select your project, then click orn the console icon on the top right corner.

![console launcher](img/console.png)

__ After you open your console, you should have something like this: __
<hr>

![console](img/console2.png)


### Pulling Docker image

To pull the docker image just type and run:
<pre style="background-color: #ebece4;padding: 10px;border-left: solid 4px orange;">
gcloud docker -- pull launcher.gcr.io/google/mongodb3:latest
</pre>

### Creating the necessary directories
Just run:
<pre style="background-color: #ebece4;padding: 10px;border-left: solid 4px orange;">
mkdir -p ~/mongo/data/shard1
mkdir -p ~/mongo/files
</pre>

We need to pull the _cities.txt_ file from the bucket to the console.
<pre style="background-color: #ebece4;padding: 10px;border-left: solid 4px orange;">
gsutil cp gs://bigdatasystem_1234_bucket/cities.txt ~/mongo/files/cities.txt
</pre>
**Note:** Remember to use your bucket name


### Running docker
To create a MongoDB instance just run this command. 

<pre style="background-color: #ebece4;padding: 10px;border-left: solid 4px orange;">
docker run \
  --name server1 \
  -p 27017:27017 \
  -v ~/mongo/data/shard1:/data/db \
  -v ~/mongo/files:/files \
  -d \
  launcher.gcr.io/google/mongodb3
</pre>

- --name sets the name of the docker container 
- -p sets the port for MongoDB, in this case 27017.
- -v maps a host directory to the container; e.g. maps the ~/mongo/data/shard1 to the /data/db in the container
- -d indicates the process to run in the container. 


**Once you run this command, you should get an hexadecimal id of the image**.

### Docker Reference:
[Reference](https://docs.docker.com/) and [Cheat Sheat](https://github.com/wsargent/docker-cheat-sheet)

#### To check which are the dockers containers currently executing 
Run:
<pre style="background-color: #ebece4;padding: 10px;border-left: solid 4px orange;">
docker ps
</pre>

#### To stop a docker container
Run:
<pre style="background-color: #ebece4;padding: 10px;border-left: solid 4px orange;">
docker stop \<dockerName\>
</pre>
Example:
<pre style="background-color: #ebece4;padding: 10px;border-left: solid 4px orange;">
docker stop server1
</pre>

#### To remove (destroy) a docker container
Run:
<pre style="background-color: #ebece4;padding: 10px;border-left: solid 4px orange;">
docker rm \<dockerName\>
</pre>
Example:
<pre style="background-color: #ebece4;padding: 10px;border-left: solid 4px orange;">
docker rm server1
</pre>

## Connect to MongoDB

To execute the client run:
<pre style="background-color: #ebece4;padding: 10px;border-left: solid 4px orange;">
docker exec -it server1 mongo admin
</pre>

You should get the following output:

![mongo client](img/mongoClient.png)

To end the client just run:
<pre style="background-color: #ebece4;padding: 10px;border-left: solid 4px navy;">
quit();
</pre>


- Create the database **mydb** and the collection **cities**:

<pre style="background-color: #ebece4;padding: 10px;border-left: solid 4px navy;">
#Create the DB if not exists 
use mydb  
#Creates the collection.
db.createCollection("cities")
</pre>

- Verify the existence of the database (mydb) and the database collection (cities):

<pre style="background-color: #ebece4;padding: 10px;border-left: solid 4px navy;">
show dbs
show collections
</pre>


<span style="color:RED">Paste the output of the previous four commands here. </span>

### Load some data
Quit the client so we can use the mongoimport to load data.

<pre style="background-color: #ebece4;padding: 10px;border-left: solid 4px navy;">
quit();
</pre>

- First let's check that we have the file in the correct directory. 

<pre style="background-color: #ebece4;padding: 10px;border-left: solid 4px orange;">
ls -al ~/mongo/files
</pre>
Should list the cities.txt

- We will use the **mongoimport** tool to load documents from a text file. The syntax is:
<pre style="background-color:#999999;padding:5px;">
mongoimport  --db <database> --collection <collection> --file <filepath/filename>
</pre>

- So, to execute it through docker just run:
<pre style="background-color: #ebece4;padding: 10px;border-left: solid 4px orange;">
docker exec -it server1 mongoimport  --db mydb --collection cities --file /files/cities.txt
</pre>

<span style="color:RED">Paste the output here. </span>

### Question 2:
In English describe the content of the database collection. 

- Test the command 
<pre style="background-color: #ebece4;padding: 10px;border-left: solid 4px navy;">
db.cities.find().pretty()
</pre> 

<span style="color:RED">Include a sample of the result and your description of what that command does.</span>

<hr style="border: 3px double navy;" >
Using the [MongoDB Reference](https://docs.mongodb.com/manual/reference/mongo-shell/), or the information from the slides, answer the following queries using the cities collection.

### Query 1:
List all the cities of the State of Colorado.

<span style="color:RED">Place your code and a sample of the result here.</span>

### Query 2:
List the first 10 cities of the State of Colorado.

<span style="color:RED">Place your code and a sample of the result here.</span>

### Query 3:
List the 10 cities of the State of Colorado with most population

<span style="color:RED">Place your code and a sample of the result here.</span>

### Query 4:
List the 10 cities with most population

<span style="color:RED">Place your code and a sample of the result here.</span>

<hr style="border: 3px double navy;" >
## Map-Reduce on MongoDB
As we discussed in class, we can run map-reduce jobs on Mongodb

Let's count the number of cities per state.

<pre style="background-color: #ebece4;padding: 10px;border-left: solid 4px navy;">
db.cities.mapReduce(
                    function()           { emit(this.state,1); }, 
                    function(key,values) {return key,Array.sum(values);}, 
                    { out: "citiesPerState" } 
                   )
</pre>

That code will generate a new collection instead of displaying the result. 
Use the commands discussed before to list the collections and to get the information from the new collection.

<span style="color:RED">Place your code output (map-reduce) here.</span>
<hr>
<span style="color:RED">Place the list of collections here.</span>
<hr>
<span style="color:RED">Place the content of the new collection here.</span>

<hr style="border: 3px double navy;" >
## Map-Reduce on MongoDB
As we discussed in class, we can run map-reduce jobs on Mongodb

Let's count the number of cities per state.

<pre style="background-color: #ebece4;padding: 10px;border-left: solid 4px navy;">
db.cities.mapReduce(
                    function()           { emit(this.state,1); }, 
                    function(key,values) {return key,Array.sum(values);}, 
                    { out: "citiesPerState" } 
                   )
</pre>

That code will generate a new collection instead of displaying the result. 
Use the commands discussed before to list the collections and to get the information from the new collection.

<span style="color:RED">Place your code output (map-reduce) here.</span>
<hr>
<span style="color:RED">Place the list of collections here.</span>
<hr>
<span style="color:RED">Place the content of the new collection here.</span>

### Map-Reduce Query 1:
Generate a collection called **populationPerState** that contains the population of each state.

<span style="color:RED">Place your code, code output and a sample of the result here.</span>

### Map-Reduce Query 2:
Generate a collection called **totalPopulation** that contains the entire population of the USA.

<span style="color:RED">Place your code, code output and a sample of the result here.</span>

> Hint: you can use your previous computed collection.

<span style="text-align:center;font-size:30px;color:#2F632A">
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Stop and Delete your Docker WHEN YOU ARE FINNISHED 
</span>
<br><hr style="border: 3px double navy;" >
<br>

# <span style="color:#3665af">REDIS </span><span style="font-size:15px">(Estimated time: 4 hours) </span>
<hr>
The objective of this assignment is to introduce the use of REDIS (an in-memory data store) to collect data from various sources for subsequent data processing. 

For this assignment, we will be using REDIS on Google Cloud and Twiter Python libraries.

<b><u>Notebook Layout (Table of contents):</u></b>
1. Environment Set-up
   - Deployment a REDIS Cluster
   - Network Configuration
   - Enabling Remote Access
   - Installing Python libraries
2. Getting Familiar with REDIS
3. Retrieving Information From Twitter
   - Creating Twitter Credentials
   - Accessing Twitter
   - Saving Tweets to REDIS
   - Retrieving Tweets from REDIS
4. Load Database From CSV


<div style="font-size:30px;color:#3665af;background-color:#E9E9F5;padding:10px;">1. Environment Set-up </div>

For this assignment we will be using the REDIS deployment on Google Cloud. Please follow the instructions to set up the cluster. You can opt to install REDIS on your own system. We don't recommend this approach though.<br><br>

<div style="font-size:20px;color:#F1F8FC;background-color:#0095EA;padding:10px;">1.1. Deployment a REDIS Cluster </div>

Go to the Menu and select **Cloud Launcher**. Then use the search box, and search REDIS.

<img src="img/redis_launcher.png" style="width:1000px;">

We will deploy a REDIS cluster using three small-cpu nodes:
- **Name**: redis-1
- **Zone**: us-central1-f
- **Instance Count**: 3
- **Machine Type**: small (1.7Gb memory)
- **Boot Disk size**: 50GB

All other parameters are set to default values. The cost of this deployment is the same as that of any other VM created on the cloud. Your deployment configuration should be similar to this:<br>

<img src="img/redis_deployment.png" style="width:1000px;">



<div style="font-size:20px;color:#F1F8FC;background-color:#0095EA;padding:10px;">1.2. Network Configuration </div>
Once you have your cluster deployed, we need to setup a firewall rule to allow us access from our Jupyter-Notebook. To do that, go to the menu, and select **VPC Network**. Then select Firewall Rules and create a new rule as depicted below:


![terminal](img/network.png)




<div style="font-size:20px;color:#F1F8FC;background-color:#0095EA;padding:10px;">
    1.3. Enabling Remote Access to REDIS
</div>

After creating the firewall rule, we also need to configure REDIS to allow access from the external network. Follow this procedure to accomplish that:

- Open the SSH terminal to the main server, by selecting SSH on the _redis-1-db-vm-0_ VM (the Compute Engine menu on the Cloud).

<img src="img/redis_vm.png" style="width:550px;">
 
- Launch the REDIS Client on the console 

<pre style="background-color: #ebece4;padding: 10px;border-left: solid 4px orange;">
redis-cli
</pre>

- Within the client change the config to disable protected mode

<pre style="background-color: #ebece4;padding: 10px;border-left: solid 4px navy;">
CONFIG SET protected-mode no
</pre>
You should get an **OK** message. Then type:
<pre style="background-color: #ebece4;padding: 10px;border-left: solid 4px navy;">
quit
</pre>

Once this is completed, we should be able to access REDIS through the public ip we get on the VM list

<img src="img/redis_vm_ip.png" style="width:550px;">


<div style="font-size:20px;color:#F1F8FC;background-color:#0095EA;padding:10px;">
    1.4. Installing Python libraries
</div>

You need to install two libraries for this assignment: 
- REDIS library, and
- Twitter library.

- On your machine, where you run the Jupyter Notebook, open a new terminal (use the Anaconda terminal), and run

<pre style="background-color: #ebece4;padding: 10px;border-left: solid 4px orange;">
pip install tweepy
pip install redis
</pre>




<div style="font-size:30px;color:#3665af;background-color:#E9E9F5;padding:10px;">
    2. Getting Familiar with <b>REDIS</b> 
</div>

You will found the REDIS commands [here](https://redis.io/commands)

In [None]:
##Import the library
import redis

In [None]:
## Connect to the server
REDIS_SERVER = 'Your REDIS IP address'
REDIS_PORT   = 6379
myRedis = redis.StrictRedis(host=REDIS_SERVER, port=REDIS_PORT, db=0)

In [None]:
## Dropping Everything we got on REDIS
display(myRedis.flushdb())
display(myRedis.flushall())

In [None]:
print("I'm storing a value on the key 'myKey'")
display(myRedis.set('myKey', 'This is the key value'))
print("I'm reading the value of 'myKey' from REDIS")
display(myRedis.get('myKey'))

In [None]:
print("I'm storing a List on REDIS")

print("Adding elements to the end of the list")
display(myRedis.rpush('weekdays','Tuesday'))
display(myRedis.rpush('weekdays','Wednesday'))
display(myRedis.rpush('weekdays','Thursday'))
display(myRedis.rpush('weekdays','Friday'))
print("Current List Length:", myRedis.llen('weekdays'))
display("Current Weekdays Content:",myRedis.lrange('weekdays',0,-1))


print("Adding elements to the beginning of the list")
display(myRedis.lpush('weekdays','Monday'))

print("Current List Length:", myRedis.llen('weekdays'))
display("Current Weekdays Content:",myRedis.lrange('weekdays',0,-1))


In [None]:
print("I'm storing a HASH on REDIS")
print("- Remember that hashes can be used to store documents!")

#create a dictionary
user = {"Name"    :"myName", 
        "Company" :"myCompany", 
        "Address" :"myAddress", 
        "Location":"MyLocation"}

print ("Store to REDIS")
display(myRedis.hmset("userDictionary", user))

print ("Retrieve from REDIS")
display(myRedis.hgetall("userDictionary"))


<div style="font-size:30px;color:#3665af;background-color:#E9E9F5;padding:10px;">
    3. Retrieving Information From Twitter</div>
<br>
<div style="font-size:20px;color:#F1F8FC;background-color:#0095EA;padding:10px;">
    3.1. Creating Twitter Credentials
</div>

In order to be able to access tweets from our application, we need a Tweeter account, consumer keys and access tokens.

To generates these, go to (https://apps.twitter.com) and **Create a New App**. Fill in the form and agree with the terms.

Once that's done, select your app and the tab **Keys and Access Tokens**

<img src="img/twitter.PNG" style="width:550px;">


<div style="font-size:20px;color:#F1F8FC;background-color:#0095EA;padding:10px;">
    3.2. Accessing Twitter
</div>


In [None]:
import tweepy
from tweepy import OAuthHandler
 
consumer_key    = 'PLACE_YOUR_KEYS'
consumer_secret = 'PLACE_YOUR_KEYS'
access_token    = 'PLACE_YOUR_KEYS'
access_secret   = 'PLACE_YOUR_KEYS'
 
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
 
api = tweepy.API(auth)

In [None]:
for tweet in tweepy.Cursor(api.home_timeline).items(2):
    # Process a single tweet

    print(tweet._json.keys())
    print()
    print(tweet._json["id"])
    print(tweet._json["text"])
    print(tweet._json["source"])
    print(tweet._json["lang"])
    print(tweet._json["retweeted"])    
    print(tweet._json["retweet_count"])
    print(tweet._json["favorite_count"])
    print()
    

<div style="font-size:20px;color:#F1F8FC;background-color:#0095EA;padding:10px;">
    3.2. Saving Tweets to REDIS
</div>


In [None]:
# Lets save tweets

howManyTweets = 20

for tweet in tweepy.Cursor(api.home_timeline).items(howManyTweets):
    # Process a single tweet
    
    ##Formatting the tweet 
    redisTweet = {
                  "text"           :tweet._json["text"].encode('utf-8'), 
                  "source"         :tweet._json["source"].encode('utf-8'), 
                  "lang"           :tweet._json["lang"].encode('utf-8'), 
                  "source"         :tweet._json["source"].encode('utf-8'), 
                  "retweet_count"  :tweet._json["retweet_count"], 
                  "favorite_count" :tweet._json["favorite_count"]
                 }

    ## Saving the tweet as HASH
    myRedis.hmset(tweet._json["id"], redisTweet)
    #display(tweet._json["id"])
    
    ## Adding the Tweet id to the list of tweets
    myRedis.rpush("tweets",str(tweet._json["id"]))
       
print("Done!")

<div style="font-size:20px;color:#F1F8FC;background-color:#0095EA;padding:10px;">
    3.2. Retrieving Tweets from REDIS
</div>


In [None]:
for id in myRedis.lrange("tweets",0,99):
    print()
    print("Displaying Tweet with ID:",id)
    print("Text:",myRedis.hmget(id,"text"))
    print("ALL DATA:",myRedis.hgetall(id))
    print("========================================================================")

<div style="font-size:30px;color:#3665af;background-color:#E9E9F5;padding:10px;">
    4. Load Database From CSV
</div>

Using all you have learned so far, load the dataset from the CITES Wildlife Trade Database competition available [here](https://www.kaggle.com/cites/cites-wildlife-trade-database/data) into REDIS using a Document structure, in a similar fashion used for Twitter.

Using the data loaded into REDIS, compute the number of animals per class for each importer.

Your output should be similar to this:

US . Carnivora . XXX
US . Aves . XXX
...

Measure the running time, and present the average running time for the load and the processing operations.

**Please also explain your code**


PLACE YOUR ANSWERS/CODE IN CELLS BELOW


<div style="font-size:20px;background-color:#BE6D00;color:#F6EFE5;padding:10px;text-align:center;">
STOP YOUR CLUSTER WHEN YOU ARE NOT WORKING<br><br>
ONCE YOU ARE FINISHED, DELETE YOUR CLUSTER
</div>
<hr style="border: 3px double navy;" >
<br>