<h1 style='text-aling:center;color:Navy'>  Big Data Systems - Laboratory 1  </h1>

Topics included in this lab:
 1) Map-Reduce (20 minutes)
 2) Semistructured DS: Spanner DB (20 minutes)
 3) Bigtable (20 minutes)

We have allocated 15 minutes for set up and/or to address any other issues. 


This is the first Lab for the Big Data Systems course - Spring 2018.<br>
We are covering the following topics:
<li>Map-Reduce,</li>
<li>Spanner and</li>
<li>Bigtable</li>
<br>
The lab will have an in-class section as well as a homework section. <br>
You need to submit your in-class notebook before the end of the class. For the homework section, please refer to Canvas for the due date.

# <span style="color:#3665af">Section #3: Bigtable </span><span style="font-size:15px">(Estimated time: 20 minutes) </span>

<hr>
In this section we will practice how to use Google's Bigtable database. 
## Pre-reqs:
Your Google cloud account should be ready to deploy services.<br>

### Create a Bigtable Instance

- Create a development Bigtable instace, and annotate the Instance ID. 
- We will use lab1-section3 instance id in this lab; if you are using another id, you need to change in the connection settings. 
- For this test, use a development instance, which has only one node but is way cheaper. 


### Installing the Python Client
We need to have the google cloud library installed in our system.<br>
If you are using windows/mac, go to the Anaconda navigator, then Environments. Click the arrow next to _base (root)_ and select Open terminal. <br>
Then execute the following pip command to install the client library. 

<pre>
$> pip install google-cloud 
$> pip install google-cloud-happybase
</pre>

### Getting the Service Account File
As mentioned in the cloud set up instructions, you need to generate a token file so you will be able to connect to bigtable. Refer to that document for help, especially if you did not manage to implement lab 1 section #2

Documentation [here](http://google-cloud-python.readthedocs.io/en/latest/bigtable/usage.html)

## Hands on...

In [None]:
JSON_SERVICE_KEY = 'JSON SERVICE KEY PATH AND FILENAME'

You can learn more about creating instances and using Spanner with Python at [Google Documentation](https://cloud.google.com/spanner/docs/getting-started/python/)


In [None]:
def explicit():
    ## Function to connect to spanner
    from google.cloud import storage

    # Explicitly use service account credentials by specifying the private key
    # file.
    storage_client = storage.Client.from_service_account_json(JSON_SERVICE_KEY)

    # Make an authenticated API request
    buckets = list(storage_client.list_buckets())
    print(buckets)

You can get help about authentication at [Google Documentation](https://cloud.google.com/docs/authentication/production#auth-cloud-explicit-python)


## Warm up
<hr>

In [None]:
project_id  = "bigdatasystems-spring2018"
instance_id = "lab1-section3"

## <span style="color:#5DB664">Using Bigtable Client</span>

In [None]:
# Imports the Google Cloud Client Library.
from google.cloud import bigtable

# The client must be created with admin=True because it will create a table.
client     = bigtable.Client(project=project_id, admin=True)
instance   = client.instance(instance_id)


I you didn't receive an error, it means that at this point we're connected to our Bigtable database.<rb>

### Let's create a table

In [None]:
table_name  = "greetings"
print('Creating the {} table.'.format(table_name))

table = instance.table(table_name)
table.create()

column_family_name = 'cf1'
cf1 = table.column_family(column_family_name)
cf1.create()

#table.delete()  #to delete the table.

print("done!")
### WARNING
#
## You will get an error the first time saying that you did not enable admin api. 
## A link will be given. Follow it and enable the API And retry.

### Inserting data

In [None]:
print('Writing some greetings to the table.')
column_name = 'greeting'.encode('utf-8')
greetings = [
    'Hello World!',
    'Hello Cloud Bigtable!',
    'Hello Bigtable with Python!',
]

for i, value in enumerate(greetings):
    row_key = 'greeting{}'.format(i)
    row = table.row(row_key)
    row.set_cell(   
                    column_family_name,
                    column_name,
                    value.encode('utf-8')
                )
    row.commit()
    

print('done!')

### Reading data

In [None]:
print('Getting a single greeting by row key.')
key = 'greeting0'

row = table.read_row(key.encode('utf-8'))

value = row.cells[column_family_name][column_name][0].value

print('\t{}: {}'.format(key, value.decode('utf-8')))

In [None]:
print('Scanning for all greetings:')
partial_rows = table.read_rows()
partial_rows.consume_all()

for row_key, row in partial_rows.rows.items():
    key = row_key.decode('utf-8')
    cell = row.cells[column_family_name][column_name][0]
    value = cell.value.decode('utf-8')
    print('\t{}: {}'.format(key, value))

<hr>

## <span style="color:#5DB664">Using Happybase Client</span>

In [None]:
# Imports the Google Cloud Client Library.
from google.cloud import happybase

# The client must be created with admin=True because it will create a table.

client     = bigtable.Client(project=project_id, admin=True)
instance   = client.instance(instance_id)
connection = happybase.Connection(instance=instance)

I you didn't receive an error, it means that at this point we're connected to our Bigtable database.<rb>

### Let's create a table

In [None]:
table_name  = "greetings2"
print('Creating the {} table.'.format(table_name))
column_family_name = 'cf1'
connection.create_table(    table_name,
                            {
                                column_family_name: dict()     # Use default options.
                            }
                       )
### WARNING
#
## You will get an error the first time saying that you did not enable admin api. 
## A link will be given. Follow it and enable the API And retry.

### Inserting data

In [None]:

print('Writing some greetings to the table.')
table = connection.table(table_name)
column_name = '{fam}:greeting'.format(fam=column_family_name)
greetings = [
    'Hello World!',
    'Hello Cloud Bigtable!',
    'Hello HappyBase!',
]

for i, value in enumerate(greetings):
    row_key = 'greeting{}'.format(i)
    table.put(row_key, {column_name: value})

print('done!')

### Reading data

In [None]:
print('Getting a single greeting by row key.')
key = 'greeting0'.encode('utf-8')
row = table.row(key)
print('\t{}: {}'.format(key, row[column_name.encode('utf-8')]))


In [None]:
print('Scanning for all greetings:')

for key, row in table.scan():
    print('\t{}: {}'.format(key, row[column_name.encode('utf-8')]))

#     print('Deleting the {} table.'.format(table_name))
#     connection.delete_table(table_name)


In [None]:
connection.close()

<hr>

## Let's load some interesting data
<hr>
**We will use the bigtable client but it can be done using happybase too.**

In [None]:
import pandas as pd
import datetime
from time import time
import math

In [None]:
df_Measures        = pd.read_csv("/path/to/measurements.sample.csv")

In [None]:
table_name  = "RadiationMeasurements"
table_columns = [
                    ("time","Captured Time"),("time","Uploaded Time"),
                    ("location","Latitude"),("location","Longitude"),("location","Height"),
                    ("measure","Value"),("measure","Unit"),
                    ("device","Device ID")
                ]

print('Creating the {} table.'.format(table_name))

RadiationMeasurements = instance.table(table_name)
RadiationMeasurements.create()


columnFamilies = []
for aColumn in table_columns:
    columnFamilies.append(aColumn[0])
columnFamilies = list(set(columnFamilies))

for aColumnFamily in columnFamilies:
    cf = table.column_family(aColumnFamily)
    cf.create()
     

print("done!")
### WARNING
#
## You will get an error the first time saying that you did not enable admin api. 
## A link will be given. Follow it and enable the API And retry.

In [None]:
#RadiationMeasurements.delete()   #UNCOMMENT IF YOU NEED TO DROP THE TABLE. 

In [None]:
for index,dfRow in df_Measures.iterrows():
    row_key = 'measurement_{}'.format(index)
    row = RadiationMeasurements.row(row_key)
    

    for aColumn in table_columns:  #([0],[1]) maps to (columnFamily,columnName)
        row.set_cell(   
                        aColumn[0],
                        aColumn[1],
                        str(dfRow[aColumn[1]]).encode('utf-8')
                    )
    row.commit()

print('done!')

In [None]:
print('Scanning 5 Measurements:')
partial_rows = RadiationMeasurements.read_rows(limit=5)
partial_rows.consume_all()

for row_key, row in partial_rows.rows.items():
    key   = row_key.decode('utf-8')
    rowArr = []
    for aColumn in table_columns:
        rowArr.append(row.cells[aColumn[0]][aColumn[1].encode("utf-8")][0].value)
    print("Key:",key)
    for i in range(len(table_columns)):
        print("      Data:",table_columns[i][0],table_columns[i][1],":",rowArr[i])

In [None]:
print('Scanning measurement 3 to 5:')
partial_rows = RadiationMeasurements.read_rows(start_key="measurement_3",end_key="measurement_5")
partial_rows.consume_all()

for row_key, row in partial_rows.rows.items():
    key   = row_key.decode('utf-8')
    rowArr = []
    for aColumn in table_columns:
        rowArr.append(row.cells[aColumn[0]][aColumn[1].encode("utf-8")][0].value)
    print("Key:",key)
    for i in range(len(table_columns)):
        print("      Data:",table_columns[i][0],table_columns[i][1],":",rowArr[i])

### QUESTION:
**_In which line is the data actually fetched from Bigtable?_** Explain briefly. 

#### <span style="color:red"> --- Answer HERE --- </span>

### QUESTIONS:
**_What is the difference between the previous two blocks of code?_** Explain Briefly

#### <span style="color:red"> --- Answer HERE --- </span>

In [None]:
print('Sum all Measurements that the unit is cpm:')

###############
####    TO DO HERE
##############

print ("The sum is:", totalSum)

# <span style="color:#5DB664">DELETE YOUR BIGTABLE INSTANCE AS WHEN YOU FINNISH </span>

<hr><BR>

# <span style="color:RED">Homework: </span> Yelp Dataset
<hr>
Similar to the example above, we are going to be using the New York City - Buildings competition dataset available [here](https://www.kaggle.com/new-york-city/nyc-buildings/data). <br>
You should:
1. Download the dataset to your machine.
2. Create a new database.
3. Create the necessary tables to load the Brooklyn subset.
4. Load the Brooklyn dataset, but be smart when uploading. If we don't have a value for a particular cell, don't load it into Bigtable.


## Questions:
1. Report the time for loading the dataset.
2. Generate a report that for each zipcode displays the average of the lot and building front area. <br>
In this query the performance is very important as you are reading a noticeable ammount of data. Came across with an initial procedure to solve the query. Later try to improve that code. Did your second implementation improved the running time? Explain why. Present the code used in both stages and the runtime for each one. 



In [None]:
## PLACE YOUR CODE STARTING THIS POINT. 

<hr style="border: 3px double navy;" ><br><br><br>