# Zero to API

![](http://)Table Of Contents  
==

[Introduction](#Introduction)  
[DynamoDB](#DynamoDB)  
[Lambda](#Lambda)  


# Introduction

This notebook is a guide to build a serverless API using AWS lambda. The entire pipeline from data curation to data retrival is built using AWS products like S3 bucket, Dynamodb and lambda.
![Zero to API](https://i.imgur.com/qiCcCNL.jpg)


# DynamoDB

Amazon DynamoDB is a nonrelational database. Data is read in the form of dictionary.
DynamoDB involves table creation with two attributes:

    1) Partition Key - For our case it is 'indiv'
    2) Sort Key - For our case it is 'time' (Optional)
Sort key is optional but can improve query time if the column is likely to be queried in the API.

DynamoDB only allows query parameters with Partition key and Sort key, which means you cannot query for a
column which is neither Partition or Sort key.

Ex- Our data has columns ['time','x', 'y', 'indiv'] filtering and subsetting can only be done on our Partition key('indiv')
and sort key('time') and not on either 'x' or 'y', so a query to fetch all 'x' == 728.2 won't work.

DynamoDB sets read and write limit which are both defaulted to 5 hits. This can be increase or decreased based on API requiremnt.


#### Uploading Data from S3 to Dynamodb

Once done creating a [DynamoDB](https://aws.amazon.com/dynamodb/) table using the GUI on AWS, next step involves reading the csv from S3 bucket to this newly created table.

For the data upload since the default write limit(5) would be low for 1 million rows to be transferred we'll increase this number to 1000 on the DynamoDB GUI ![DaynamoDB](https://i.imgur.com/EzC3t8R.png)

We have created a table called 'baboons' with Partition Key = 'indiv' and Sort Key = 'time

Once done trasnferring write capacity units should be set back to lower values for lesser cost(default =5).

Since we have a million rows to process we will subset out data files into small chunks(10 for this example) and batch process them to make upload faster.

The run time for this would be around 15mins, without multiprocesing it would around 150mins.

In [None]:
import pandas as pd
import boto3  # Library to transfer files to DynamoDB
from tqdm import tqdm #Library to get a progress bar
import numpy as np
from multiprocessing import Pool # Multiprocess data upload.

def dynamodb_upload(data_frame):
    col_to_string = ['x', 'y', 'indiv'] #converting columns to string for DynamoDB
    for i in col_to_string:                                                                                                                        
        data_frame[i] = data_frame[i].astype(str)
    mydata=data_frame.T.to_dict().values() # converting dataframe to a dict for DynamoDB to consume
    MY_ACCESS_KEY_ID = ''
    MY_SECRET_ACCESS_KEY = ''
    resource = boto3.resource('dynamodb', aws_access_key_id=MY_ACCESS_KEY_ID, aws_secret_access_key=MY_SECRET_ACCESS_KEY, region_name='us-east-1')
    table = resource.Table('baboons') # your table name where you want to push the data
    with table.batch_writer() as batch: # batch writer
        for row in tqdm(mydata):    #progress bar                                                                                                               
            batch.put_item(Item=row) 

if __name__=='__main__':
    data = pd.read_csv('https://s3-us-west-2.amazonaws.com/himatdata/baboons/baboons_ritxy.csv') #S3 location of baboons dataset
    split_df = np.array_split(data, 10) # splitting data into 10 smaller data frames.
    pool = Pool(processes=10)
    pool.map(dynamodb_upload, split_df)

# Lambda

AWS Lambda lets you run code without provisioning or managing servers and help building a serverless API.

lambda provides lightweight serverless way to serve an API. One downside is it doens't come with all python libraries except for the base packages and boto(Aws package). In order to use any other package we have to zip the package alongside our 'lambda_function.py' file for it to work.

In our case we need 'json2html' so we need to zip json2html folder alongside the our 'lambda_function.py' for it to work and uplaod it to lamdba.

Lets build the zip file for our lamdba function.

1)create a new virtual env with conda - conda create -n my_new_env_name.

2)activate the new env created - soure activate my_new_env_name.

3)pip install the package you might need for your lambda to work- pip install json2html.

4)grab the json2html folder. Mostly it would be in the site-packages path of anaconda ex- '/home/shivraj/anaconda3/lib/python3.6/site-packages/json2html'.

5)zip the folder json2html obtained from step4 and also your lambda_function.py which we'll talk about.

6)Upload the zip file in the lambda AWS UI.

This is how the UI of lambda would look like:
![lambda_ui](https://i.imgur.com/9KFK665.png)

This now has the required package json2html as folder which lambda can read from and our main module lambda_function.py

#### lambda_function.py 
This lambda function builds API interface to interact with our DynamoDB and return requested data based on query parameters.

In [None]:
import json
from boto3.dynamodb.conditions import Key, Attr
import boto3
from pprint import pformat
from json2html import *

MY_SECRET_ACCESS_KEY = '' 
MY_ACCESS_KEY_ID = ''

def lambda_handler(event, context):
    dynamodb = boto3.resource('dynamodb', aws_access_key_id=MY_ACCESS_KEY_ID, aws_secret_access_key=MY_SECRET_ACCESS_KEY, region_name='us-east-1')
    table = dynamodb.Table('baboons')
    baboon = str(event["queryStringParameters"]['indiv']) # parse api call to get indiv
    d0 = str(event["queryStringParameters"]['d0']) # parse api call to get start time
    dt = str(event["queryStringParameters"]['dt']) # parse api call to get end time
    data_frame_flag = event["queryStringParameters"]['table'].lower() == "true" # parse api call to get table flag
    # query dynamoDB for the parsed filter parameters and return a response.
    response = table.query(KeyConditionExpression=Key('indiv').eq(baboon) & Key('time').between(d0, dt))

    dict_string = pformat(response['Items']) #prettify the dict for asthetics
    #conditional return a html table based on table flag in the api call.
    if not data_frame_flag:
        print("Returning JSON")
        return { "statusCode": 200, "body": dict_string }
    else:
        print("Returning HTML")
        return { 
            "statusCode": 200, 
            "body": json2html.convert(response['Items']),  
            "headers": {
        'Content-Type': 'text/html',
    }}

Sample API call would be: https://ezmt8rznkh.execute-api.us-east-1.amazonaws.com/default/baboon?indiv=1&table=true&d0=0:02:52&dt=0:02:58

Here params are:

indiv=1 (to get baboons with id 1)

table=true/false (whether to return a json or html formatted data)

d0= start time ( time from which data required)

dt= end time (time till which data should be queried)

        